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Abstract 


Paired  comparison  experiments  involve  comparisons  among  a  set  of  objects,  treatments, 
or  competitors  in  blocks  of  size  two  to  determine  relative  merits  among  the  objects. 
Applications  to  tournament  chess  competition  or  team  sports  differ  from  the  usual  paired 
comparison  framework  in  that  competitors’  abilities  may  change  over  time.  A  dynamic 
model  is  developed  for  the  usual  paired  comparison  experiment  in  which  only  a  binary 
indicator  of  the  preferred  object  is  available  (or  trinary,  if  ties  are  allowed),  and  for 
alternative  paired  comparison  experiments  in  which  more  information  (perhaps  the  score) 
is  available.  Posterior  distributions  of  the  rating  parameters  and  posterior  prediction 
intervals  for  the  results  of  future  comparisons  are  obtained  using  methods  that  are  closely 
related  to  the  Kalman  filter.  The  methodology  is  applied  to  the  analysis  of  professional 
football  game  outcomes  and  to  chess  tournament  results  from  the  1988-1989  World  Cup. 
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Chapter  1 


Introduction 


Accurate  estimation  of  a  tournament  chess  competitor’s  playing  strength  is  an  important 
goal  of  tournament  chess  organizers  like  the  U.  S.  Chess  Federation  (USCF),  and  is  also 
of  interest  to  players,  as  they  would  like  to  have  their  strength  quantified.  Suppose  the 
results  of  tournament  chess  games  over  many  years  are  observed.  One  approach  to  rating 
chess  players’  strengths  is  to  use  the  results  from  the  most  recent  competitions  only  and 
rank  players  based  on  these  game  outcomes.  This  may  not  be  the  most  desirable  approach 
because  it  ignores  potentially  useful  information  from  earlier  competitions.  Another  ap¬ 
proach  is  to  combine  all  the  data  and  rank  players  based  on  the  aggregated  data.  This 
seems  unsatisfactory  as  well  because  previous  competition  data  are  given  as  much  weight 
as  current  data  in  ranking  players.  The  best  approach  seems  to  be  a  compromise  between 
these  two  methods,  so  that  earlier  competition  data  is  given  some  weight  in  estimating 
current  ability,  but  not  as  much  as  the  most  recent  competition  data.  Ideally,  the  degree  of 
compromise  should  be  determined  from  the  data  itself.  This  thesis  formalizes  the  degree 
of  compromise,  and  explores  an  approach  to  estimating  the  abilities  of  competitors  from 
game  outcomes  observed  over  time  using  dynamic  paired  comparison  models. 

Paired  comparison  models  are  used  to  analyze  data  from  experiments  in  which  a  set 
of  p  objects  or  competitors  are  compared  in  blocks  of  size  2.  Applications  of  paired  com¬ 
parison  models  include  studies  in  choice  behavior,  preference  testing,  sports,  and  sensory 
discrimination.  The  inferential  goals  of  paired  comparison  analyses  include  ranking  ob¬ 
jects  based  on  the  results  of  comparisons,  and  predicting  the  results  of  future  comparisons. 
Most  work  on  paired  comparison  modeling  assumes  the  outcome  is  a  binary  variable  in¬ 
dicating  which  of  two  objects  is  preferred.  This  thesis  examines  binary  outcome  models, 
but  also  examines  models  for  comparisons  with  a  continuous  outcome  measure.  Two  of 
the  most  commonly  used  paired  comparison  models  are  the  Bradley-Terry  model  (Bradley 
and  Terry  1952)  and  the  Thurstone-Mosteller  model  (Mosteller  1951;  Thurstone  1927). 
David  (1988)  provides  a  thorough  introduction  and  an  extensive  bibliography  on  topics 
in  paired  comparison  modeling. 
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The  thesis  examines  paired  comparison  models  in  which  the  merits  or  abilities  of 
objects  being  compared  may  change  over  time.  Competitive  sports  and  games,  including 
chess,  football,  hockey,  and  basketball,  involve  teams  or  players  whose  abilities  change 
over  time  due  to  effects  such  as  player  trades,  coaching  staff  changes,  and  aging  of  players. 
The  Bradley-Terry  and  Thurstone-Mosteller  models  assume  that  paired  comparison  data 
occur  simultaneously  or  within  a  short  period  of  time,  or  that  objects’  merits  do  not 
change  over  time,  so  that  blind  use  of  these  models  is  not  appropriate.  The  models 
developed  here  assign  a  rating  parameter  to  each  object  or  competitor  which  describes 
the  object’s  strength  or  merit,  and  the  rating  parameter  may  change  over  time.  Methods 
are  developed  for  paired  comparisons  where  the  outcome  variable  is  r-  -her  a  difference 
in  measurements  between  the  two  objects,  or  where  the  outcome  is  a  binary  indicator  of 
which  object  is  preferred  (or  trinary,  if  ties  are  allowed). 

The  models  developed  in  this  thesis  have  strong  connections  to  the  discrete- time 
Kalman  Alter  (Kalman  1960;  Kalman  and  Bucy  1961),  and  to  Bayesian  dynamic  models 
(West  and  Harrison  1990).  Our  models  assume  an  “observation  equation,”  which  is  the 
probability  model  for  an  outcome  of  a  comparison  at  time  t,  and  a  “system  equation” 
which  describes  the  stochastic  evolution  of  parameters  Horn  time  t  to  time  t  +  1.  The 
foundational  assumptions  for  this  model  are  that  when  objects  are  compared  or  players 
compete,  information  is  obtained  about  their  rating  parameters,  but  between  competi¬ 
tions,  changes  may  occur  that  affect  the  merits  of  the  objects  or  the  abilities  of  players, 
so  that  the  distribution  of  the  rating  parameters  becomes  less  certain  due  to  the  passage 
of  time  without  comparisons.  This  uncertainty  is  reflected  in  a  greater  variance  of  the 
posterior  distribution  of  the  rating  parameters. 

The  system  variance,  describing  the  magnitude  of  the  parameter  changes  over  time, 
is  an  unknown  parameter,  and  this  poses  some  difficulties  in  the  analyses  of  the  dynamic 
paired  comparison  models.  To  facilitate  the  analyses,  the  posterior  distribution  of  the 
system  variance  parameter  is  approximated  by  a  discrete  distribution.  Two  methods  for 
carrying  out  the  approximation  are  examined.  In  one  analysis,  iterative  simulation  via  the 
Gibbs  sampler  (Genian  and  Geman  1984)  is  employed  to  obtain  samples  from  the  posterior 
distribution  of  parameters  of  interest.  A  second  approach  involves  approximating  the  prior 
distribution  of  the  system  variance  by  a  discrete  distribution,  and  performing  a  conditional 
analysis  for  each  value  of  the  system  variance.  The  proposed  models  in  this  thesis  differ 
in  an  important  respect  from  the  usual  dynamic  models  of  West  and  Harrison  (1990)  in 
the  assumption  that  the  system  variance  is  unknown  a  priori.  Previous  uses  of  models 
like  this  typically  assume  the  variance  is  known  or  can  be  well  estimated  in  advance,  in 
this  respect  the  models  developed  in  this  thesis  are  less  restrictive  than  the  usual  models. 
Placing  probability  distributions  on  all  rating  and  variance  parameters  allows  for  a  flexible 
approach  to  modeling  paired  comparison  data. 

While  the  thesis  specifically  focuses  on  paired  comparison  models,  the  approach  to 
analyzing  dynamic  models  has  a  wide  range  of  applicability.  For  example,  the  methods 
developed  for  - .  alyzing  paired  comparison  models  with  continuous  outcomes  can  be  gen¬ 
eralized  to  normal  linear  dynamic  models  of  West  and  Harrison  (1990),  so  that  these 
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more  general  models  can  be  utilized  without  assuming  the  system  variance  is  known.  Fur 
thermore,  methodology  developed  to  analyze  paired  comparison  models  with  indicator 
outcomes  can  be  applied  to  a  wide  class  of  non-linear  dynamic  models,  including  dynamic 
versions  of  generalized  linear  models  (McCullagh  and  Nelder  1989).  Thus,  the  combina 
tion  of  dynamic  probability  models  and  Bayesian  calculations  that  are  feasible  in  modern 
computing  environments  offers  a  powerful  tool  in  analyzing  experimental  time  series  data. 

This  paper  is  organized  into  two  main  parts.  The  first  part,  which  includes  Chapters  2 
and  3,  discusses  paired  comparison  models  with  a  normally  distributed  outcome  variable, 
while  the  second  part,  which  includes  Chapters  4  and  5,  considers  paired  comparison  mod 
els  with  indicator  outcomes.  In  particular,  Chapter  2  develops  the  methodology  for  paired 
comparison  models  where  the  outcome  variable  is  the  difference  between  measurements  or 
scores.  Chapter  3  applies  the  methodology  of  Chapter  2  to  the  analysis  of  NFL  football 
game  results.  Chapter  4  examines  the  Bradley-Terry  model  and  its  dynamic  extensions, 
and  discusses  the  methodology  to  analyze  data  using  such  models.  In  Chapter  5,  the 
methodology  of  Chapter  4  is  applied  to  analyzing  the  results  of  chess  tournament  games 
from  the  1988-9  World  Cup  Chess  events  and  some  simulated  data.  Chapter  6  concludes 
the  thesis  with  a  discussion  of  the  utility  of  the  models  and  possible  extensions  for  more 
complex  paired  comparison  experiments. 


Chapter  2 


Paired  Comparisons  with 
Measured  Outcomes 


In  this  chapter,  we  consider  several  models  for  paired  comparison  data  in  which  the 
outcome  variable  is  the  difference  between  measurements  of  the  objects  or  teams.  Our 
approach  assumes  that  the  measurements  or  scores  of  each  object  are  from  independent 
normal  distributions.  The  analysis  of  the  normal  model  «  described  in  Section  2.1,  first 
assuming  that  the  variances  from  the  normal  distributions  are  identical,  and  then  consid¬ 
ering  unrestricted  variances.  For  the  remainder  of  the  development  in  this  chapter,  we 
assume  that  the  variances  are  identical.  We  examine  the  Bayesian  analysis  cf  the  normal 
model  in  Section  2.2  using  a  conjugate  prior  distribution.  In  Section  2.3,  we  introduce  the 
dynamic  model  which  assumes  that  the  means  of  the  normally  distributed  measurements 
can  change  over  time.  The  magnitude  of  the  changes  is  reflected  by  a  variance  param¬ 
eter  of  the  dynamic  component  in  the  model.  We  develop  the  analysis  of  the  dynamic 
model  in  Sections  2.4  and  2.5.  Because  the  analysis  of  the  dynamic  model  is  intractable 
in  dosed  form,  we  examine  two  methods  to  facilitate  the  analysis.  Each  method  involves 
an  approximation  of  the  posterior  distribution  for  the  variance  parameter  of  the  dynamic 
component.  One  approach  uses  the  Gibbs  sampler  to  draw  a  random  sample  from  the 
marginal  posterior  distribution  of  the  variance  parameter.  A  second  approach,  incorpo¬ 
rating  a  non-iterative  technique,  approximates  the  continuous  prior  distribution  on  the 
variance  parameter  by  a  discrete  distribution.  Extending  the  model  to  allow  for  the  in¬ 
clusion  of  covariate  information  is  discussed  in  Section  2.6.  We  then  show  in  Section  2.7 
how  to  obtain  approximate  marginal  posterior  distributions  for  parameters  of  interest, 
and  describe  how  to  obtain  the  forecast  distribution.  Sequential  updating  of  the  posterior 
distribution  of  parameters  when  new  data  becomes  available  is  discussed  in  Section  2.8. 
For  developing  of  the  models,  we  use  language  that  assumes  paired  comparison  data  are 
scores  resulting  from  team  competitions,  but  the  models  are  understood  to  have  more 
general  applicability. 
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2.1  A  Normal  Model 


Suppose  a  set  of  p  teams  participate  in  a  series  of  competitions,  and  let  Suk  be  the 
random  variable  corresponding  to  the  score  of  the  fc-th  game  produced  by  team  ik-  We 
consider  models  appropriate  when  only  YikJkk  —  Slkh  —  can  be  observed,  or  when 
the  only  relevant  information  concerning  mean  team  scores  is  the  difference  in  the  scores. 
This  latter  situation  occurs,  for  example,  in  experiments  with  paired  designs.  The  only 
information  about  the  treatments  that  is  not  confounded  with  blocks  are  the  differences 
in  measurements  within  blocks.  We  assume  that  the  Sikk  come  from  independent  normal 
distributions.  We  consider  a  model  that  assumes  equal  variances  in  Section  2.1.1,  and 
then  examine  a  model  with  unrestricted  variances  in  Section  2.1.2. 


2.1.1  Equal  Variances 

The  normal  model  with  equal  variances  assumes  team  t  produces  scores  that  are  indepen¬ 
dently  distributed 

with  possibly  different  means,  but  equal  variances 

Suppose  a  total  of  n  comparisons  among  the  p  teams  are  observed.  As  before,  denote 
by  S,kk  the  score  produced  by  team  i*  on  the  fc-th  comparison,  fc  =  1, . . . ,  n,  where  it  can 
take  on  values  1, . . .  ,p.  We  have  under  the  assumptions  above,  given  6,k,  0]k  and  r2, 

Yikjkk  =  —  ^jkk  ~  —  0jt,2T2),  (2.1) 

where  Yikjkk  is  the  score  difference  of  the  fc-th  competition  between  teams  u  and  jk-  We 
assume  that  the  result  of  competitions  are  independent  from  each  other,  so  that  the 
are  independent  random  variables. 

This  model  for  score  differences  can  be  derived  from  a  slightly  different  set  of  assump¬ 
tions.  If  we  assume  that  for  the  fc-th  competition, 

=  ft*  +  a*  + 

S}kk  =  0jk  +  Qk  +  (jkki  (2.2) 

where  and  (jkk  are  normally  distributed  uncorrelated  random  variables  with  mean 
0  and  variance  r2,  then  we  obtain  the  model  in  (2.1)  above.  We  can  think  of  a*  as  a 
nuisance  parameter  particular  to  competition  fc.  For  instance,  the  weather  conditions  for 
a  competition  or  other  external  factors  may  affect  both  teams’  final  scores,  and  it  is  not 
unreasonable  to  assert  that  the  effect  will  be  equal  for  both  teams.  The  model  in  (2.2) 
is  common  for  incomplete  block  designs  with  blocks  of  size  2.  The  block  effects  are  not 
of  much  interest,  and  the  analysis  of  such  models  can  proceed  by  taking  differences  of 
observations  within  blocks  to  remove  the  block  effects. 
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The  model  can  be  represented  in  matrix  form  as  follows.  Let  y  be  the  vector  of  the 
n  score  differences,  and  let  0  be  the  vector  of  the  p  team  parameters  (0i, . . . ,  0P)T.  We 
introduce  an  n  x  p  design  matrix,  X.  Suppose  that  the  fc-th  element  of  y  is  the  score 
difference  between  teams  t*  and  jk-  Then  the  Jfc-th  row  of  X  is  given  by  =  1. 

Xk]k  =  -1,  and  Xu  =  0  for  t  ik  and  /  ^  We  can  now  restate  our  model  in  the 
more  compact  form 

(2.3) 

where  Li  is  the  n  x  n  identity  matrix. 

In  the  next  section,  we  propose  a  complete  probability  model  for  0  and  describe 
approaches  to  posterior  inference.  We  begin  our  discussion  of  the  model  by  considering 
maximum  likelihood  estimation.  It  is  straightforward  to  compute  maximum  likelihood 
estimates  (MLE’s)  for  9  as  long  as  the  singularities  in  the  design  matrix  X  are  removed. 
Because  the  data  only  supply  information  on  score  differences,  we  can  only  estimate  the 
8i  uniquely  up  to  an  additive  constant.  We  impose  a  single  constraint  to  identify  6.  One 
possibility  is  the  linear  constraint,  0,  =  0.  We  incorporate  this  into  the  design  matrix 
by  augmenting  X  with  the  row  (1,1,... ,1)  and  y  with  the  extra  observation  0  with  0 
variance.  Singularities  will  arise  if  teams  can  be  partitioned  into  two  subsets  in  which 
none  of  the  teams  in  one  subset  competes  against  any  team  in  the  other  subset.  In  such  a 
situation,  team  parameters  can  be  estimated  within  subsets,  but  not  between  subsets.  No 
information  is  provided  in  the  data  to  simultaneously  estimate  all  of  the  parameters.  This 
suggests  that  an  important  consideration  in  designing  a  paired  comparison  experiment 
is  to  prevent  such  a  situation  from  occurring.  Necessary  and  sufficient  conditions  are 
discussed  in  Ford  (1957). 

If  we  call  the  augmented  matrix  X  and  the  augmented  ore  difference  vector  y,  then 
the  maximum  likelihood  estimates  of  0  and  r2  are  given  by  the  ordinary  linear  model 
estimates 

0  =  (XTX)~1XTy  (2.4) 

r2  =  ±(y-X0)J(y-X0).  (2.5) 

Estimated  variances  and  covariances  of  the  parameter  estimates  are  given  by 

Var(~0)  =  2  f2(XJX)~1.  (2.6) 

Extensions  to  include  linear  covariates  axe  straightforward.  Covariate  information  can 
be  appended  as  columns  to  the  design  matrix  X,  and  the  formulas  in  (2.4),  (2.5),  and 
(2.6)  still  hold,  as  long  as  X  remains  nonsingular. 

2.1.2  Unrestricted  Variances 

We  now  explore  a  model  where  team  scores  are  not  necessarily  homoskedastic.  Let 

S.**~N(0lt,r2), 
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The  likelihood  for  this  model  can  be  written  as 
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where  L0^s  denotes  the  likelihood  given  the  observed  data. 

Maximum  likelihood  estimates  can  be  found  using  the  EM  algorithm  (Dempster,  Laird 
and  Rubin  1977).  The  Newton-Raphson  algorithm  may  also  be  used,  but  the  ease  of 
implementation  of  the  EM  algorithm  encourages  its  use  in  this  case.  We  now  briefly 
describe  the  procedure  for  obtaining  maximum  likelihood  estimates  using  EM. 


We  note  that  the  observed  data  are  the  score  differences,  j tik3kk,  and  that  the  score  of 
team  ik  from  game  k,  which  we  denote  can  be  thought  of  as  missing  data.  Therefore 
the  result  of  the  fc-th  comparison  is  a  score  stkk  for  team  *'*,  and  a  score  atkk  -  y,kjkk  for 
team  jk,  with  S{kk  unobserved.  To  carry  out  the  M-step  of  the  EM  algorithm,  notice  that 
the  complete-data  likelihood  can  be  expressed  as  the  product  of  2n  independent  normal 
densities,  that  is, 

Lcom  oc  J]  |^-exp  -  0ik)2^j  |  •  |-i-exp  ^-~((s,ik  -  ytkjkk)  -  j  . 


The  maximum  likelihood  estimates  of  the  0,  and  the  rf  conditional  on  the  missing  data 
are  easy  to  compute.  The  maximum  likelihood  estimates  of  the  0,  are  the  sample  means  of 
the  complete  data  observations  that  correspond  to  0,  in  the  likelihood,  and  the  maximum 
likelihood  estimates  of  the  r2  are  the  sample  variances  of  the  complete  data. 

The  E-step  of  the  EM  algorithm  requires  finding  expressions  for  the  expected  sufficient 
statistics,  which  in  this  case  is  equivalent  to  finding 


and 

We  can  derive 
by  first  obtaining 


E(4,*|0i . *i.  •••,*)• 

» •  •  •»’ri » •  •  • » y) 


Var(si|k*|0i,...,7f,...,y). 
These  expressions  can  be  shown  to  be  equal  to 


E(sitfc|0i, . . . ,  0P,  r2, . . . ,  r2,  y) 


#>JT?k  +  (*A  +  yikjkk)/rjk 

'K  +  i/ijL 
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Var(s,4k|0i , . . . ,  0P,  r,2, . . . ,  r2,  y) 
=>  E(s?4fc| 8u...,6p,T?,...,T%,y) 


1 

*/4  +  1/4 

1  ( e'JT?k  +  (g»  +  y>k,kk)/rfk  \ 2 

!/r2  +  1/4  +  V  1/4  +  V4  ^  ' 


The  EM  algorithm  proceeds  as  follows.  Pick  a  set  of  p  starting  values  for  the  0,  and  for 
the  r,2.  Denote  these  and  At  the  t-th  iteration,  the  E-step 

requires 


E<‘+1>(*»*) 

E«WH») 


#W/r? 

1  tj 


+(«s:,+k.»*)/4"> 


»* 


+1/4 


2(0 


l/4(t)  +  l/r2(l) 


+  (E<‘+1W))2 


which  are  the  expected  values  and  expected  squared  values  of  the  missing  data  given 
the  parameter  estimates  after  the  f-th  iteration.  The  M-step  then  replaces  the  current 
parameter  estimates  with  the  maximum  likelihood  estimates  from  the  “complete”  data: 


0,-t+1*  <—  Sample  mean  of  expected  data  involving  team  * 

T?  *£+1^  <-  Sample  mean  of  expected  squared  data  involving  team  i,  minus  (01-,+1*)2. 

In  the  M-step,  we  choose  to  force  the  0,-  to  have  mean  0  by  subtracting  off  the  mean  of  the 
8i  at  every  iteration.  This  guarantees  that  the  8,  will  be  centered  around  0.  Otherwise,  the 
EM  algorithm  will  produce  estimates  that  may  be  translated  from  the  estimates  obtained 
by  centering  around  0,  and  the  translation  will  depend  on  the  starting  values. 


This  iteration  is  continued  until  convergence.  Depending  on  the  size  of  the  problem, 
the  algorithm  may  converge  slowly  because  the  problem  assumes  that  one-half  of  the 
complete  data  is  missing.  Asymptotic  variances  and  covariances  of  the  the  parameters 
may  be  found  using  SEM  (Meng  and  Rubin  1991)  in  conjunction  with  the  EM  algorithm. 
An  alternative  would  be  to  use  gradient  methods  which  may  converge  more  quickly,  but 
require  specification  of  the  Hessian  matrix  which  may  be  cumbersome. 


2.2  A  Bayesian  Formulation 


A  complete  Bayesian  formulation  of  the  normal  model  incorporates  prior  distributions  on 
team  abilities,  as  well  as  on  r2.  For  the  following  discussion,  we  now  let  r2  be  the  variance 
of  the  score  difference,  and  let  4>  =  l/f2  be  the  precision  of  score  differences.  We  assume 
a  normal-gamma  prior  distribution  on  ( 0 ,  <f>)  with  density 

/(»,*)  «  ^2exp{-|(»-M)T«(®-M)}  -^-‘expt-i^O 
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which  we  write  as 

(**)~N  G(m,(,R,v). 

This  representation  is  used  by  Raiffa  and  Schlaifer  (1961,  Section  13.5).  The  normal- 
gamma  distribution,  besides  accommodating  a  wide  range  of  prior  beliefs,  is  the  conjugate 
distribution  for  normal  regression  models.  The  distribution  acquires  its  name  from  the 
factorization 

<t>  ~  Gamma(tf/2,uf/2) 

(0\<t>)  ~  N(m,(W‘). 

In  the  normal-gamma  distribution,  the  parameter  ft  is  the  mean  of  0,  the  matrix  R  is  the 
information  matrix  for  0  in  units  of  <t>,  the  parameter  f  is  the  harmonic  mean  of  r2,  and 
v  is  equivalent  to  the  number  of  degrees  of  freedom  associated  with  4>. 

Suppose  we  observe  n  score  differences,  y,  along  with  the  n  X  p  design  matrix  X.  We 
assume  as  in  *2.3),  suppressing  the  conditioning  on  X , 

(y\6,4>)~X(X0,l/<t>). 

Then  following  Raiffa  and  Schlaifer  (1961),  the  posterior  distribution  for  (0,  <j>)  is 

(0,<t>\y)~KG(fi\{',R,v') 

where 

R'  =  R  +  XrX 
y!  =  R'-1(Rp  +  X7y) 
v'  =  v  +  n 

t  =  +  yTRy)  +  yJy  -  (2.7) 

Raiffa  and  Schlaifer  (1961)  also  show  that  the  posterior  distribution  of  0  marginalized 
over  <t>  is  a  multivariate  t-distribution  on  v'  degrees  of  freedom, 

f(0\y)  <x  \R\V\v'  +  (0-  ti’)JR(0  -  n')/0(v'+p)/2 

with  mean  vector  n'  and  scale  matrix  £'R' _1.  Inference  on  0  can  be  obtained  from  the 
multivariate  t  marginal  posterior. 

Typically,  information  about  teams’  abilities  is  available  prior  to  competition,  and  this 
information  can  be  incorporated  into  the  prior  distribution  of  the  parameters.  If  no  prior 
information  exists,  a  multivariate  normal  prior  with  large  variances  can  reflect  this  lack 
of  information,  and  the  analysis  proceeds  without  difficulty.  Note  that  in  the  Bayesian 
model  it  is  not  necessary  to  add  a  constraint  on  the  Oi  in  order  to  obtain  a  unique  posterior 
mean  y! .  However,  it  is  necessary  to  require  the  prior  parameter  R  to  be  of  full  rank  so 
that  R  is  invertible. 

The  inclusion  of  linear  covariate  information  is  accompanied  by  a  trivial  extension  to 
the  Bayesian  model.  We  place  a  normal-gamma  prior  on  the  regression  parameters  as 
before,  and  (2.7)  again  describes  the  posterior  distribution. 
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2.3  A  Dynamic  Model 


In  many  paired  comparison  situations,  teams  will  compete  against  each  other  over  time, 
and  during  this  time  it  is  reasonable  to  believe  that  team  abilities  may  change.  For 
example,  between  seasons  of  professional  sports,  team  abilities  may  change  -  amatically 
as  a  result  of  player  trades,  training,  changes  in  coaching  staff,  and  so  on.  Vve  want  to 
reflect  this  possibility  in  modeling  the  paired  comparison  data. 

The  Bayesian  model  of  the  previous  section  is  not  appropriate  for  paired  comparisons 
where  team  abilities  may  change  over  time.  It  would  not  be  reasonable,  for  example,  to 
continually  update  the  posterior  distribution  using  (2.7)  as  new  data  is  observed,  because 
such  an  analysis  treats  the  observations  as  time-exchangeable.  Instead,  an  appropriate 
model  should  give  greater  importance  to  more  recent  results. 

We  define  a  tournament  (or  season)  as  a  collection  of  paired  comparisons  that  are 
made  simultaneously,  or  close  enough  in  time  to  treat  the  observations  as  exchangeable. 
For  simplicity,  assume  that  occurrences  of  tournaments  are  separated  by  equal  amounts 
of  time.  We  use  superscript  ( t )  to  index  tournaments,  t  =  1,2, ....T.  Let  be  the 
vector  of  score  differences  resulting  from  tournament  t,  and  let  be  the  design 
matrix,  or  tournament  schedule,  associated  with  tournament  t,  as  defined  in  Section  2.1. 
Also,  let  0^  be  the  vector  of  team  parameters  associated  with  tournament  t.  The  model 
for  observations  is 

(y(1)| r2)  ~  N( r2In„> ).  (2.8) 

Notice  that  r2  does  not  depend  on  time.  We  now  introduce  a  probability  model  that 
describes  the  evolution  of  the  parameter  0 , 

0<‘)  =  00-*)  +  j/(‘)?  (2.9) 

where  i/W  is  the  amount  by  which  team  abilities  change  between  tournaments  t  —  1  and 
t.  The  parameter  i/W  is  assumed  to  be  stochastically  independent  of  We  further 

assume 

(i/(tV2)  ~  N(at,<r2Ip),  (2.10) 

where  the  vector  parameter  at  is  the  mean  amount  by  which  teams  change  between 
tournaments  t  —  1  and  t,  and  <x2  is  the  between-season  variance  for  a  team’s  ability.  The 
parameter  at  may  be  a  function  of  covariate  information. 

We  specify  an  initial  prior  distribution 

(0(1K<t>)  ~  NG(ji{1),^l^,ff(1*,v(|)),  (2.11) 

where  /if1),  ^f1),  JZf^and  rf1)  are  the  hyperparameters  for  the  prior  distribution.  A  vague 
prior  on  flf1)  is  obtained  by  choosing  parameters  R.W  and  such  that  the  diagonal 
.  elements  of  Var(©fa))  =.  ^^(.Rf1))-^  are  large.  A  prior  mean  of  <f>  is  set  by  choosing 

an  appropriate  value  of  {W,  andTne  uncertainty  in  <f>  is  captured  by  the  value  of 
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Lower  values  of  connote  less  certainty  about  <$>.  The  off-diagonal  elements  of  J?* 1 1 
describe  the  prior  correlation  believed  to  exist  between  the  ratings  of  teams.  If  no  known 
similarities  between  teams  exist  a  priori,  we  set  the  covariances  of  teams  ratings,  and 
therefore  the  off-diagonal  elements  of  to  0. 

We  also  assume  a  prior  distribution  onu  =  l/o2  having  density 

f(u)  <x  ^“V60",  (2.12) 

which,  for  ao  >  0,  bo  >  0,  gives  the  Gamma  density.  Proper  inferences  are  possible  with 
other  values  of  ao  and  60>  however,  and  even  an  improper  prior  distribution  with  60  =  0 
may  be  a  sensible  choice.  Without  much  prior  knowledge  concerning  w,  we  may  choose 
ao  small  to  reflect  the  initial  uncertainty  of  u.  For  proper  prior  distributions,  the  value  of 
b0  can  be  chosen  so  that  b^/ao  is  an  initial  guess  (the  prior  harmonic  mean)  of  a2 . 

West  and  Harrison  (1990)  define  a  normal  dynamic  linear  model  as  consisting  of  the 
sequence  of  quadruples  H  =  {F^,  G^,  VfO,  wM*)}  where 

(y(*)|*(*>,ir)  ~  n(f<‘W‘>,f(T)), 

(0(‘)|0(‘-D,.ff)  ~  N(GW*“1,,WrW). 

The  model  defined  in  (2.8)-(2.12)  is  an  example  of  a  dynamic  linear  model  with  F(t)  = 
X^\  GW  =  Ip,  and  WM  —  a2 Ip.  Note  that  in  our  model  r2  is  unknown, 

so  we  impose  a  joint  prior  on  1/r2)  to  account  for  our  uncertainty. 

Equation  (2.8)  is  the  so-called  “observation  equation”  of  the  dynamic  model,  and 
defines  the  sampling  distribution  of  the  score  differences.  Equation  (2.9),  along  with 
the  distributional  specification  of  (2.10)  is  often  called  the  “system  equation”  or  “state 
equation”  of  the  dynamic  model.  The  assumption  underlying  these  equations  is  that 
team’s  abilities  change  on  average  by  at  from  tournament  t  —  1  to  t.  The  parameter  at 
may  be  known  a  priori,  or  it  may  involve  estimable  parameters.  We  assume  ct<  may  depend 
on  characteristics  of  the  teams  that  relate  to  the  evolution  of  their  abilities  over  time,  such 
as  age  of  the  r.'iyers  on  the  team.  However,  we  assume  for  purposes  of  developing  our 
model  that  at  —  0  for  all  t,  meaning  that  teams’  abilities  are  expected  to  remain  the  same 
relative  to  other  teams.  This  assumption  defines  a  random  walk  model  for  the  parameter 
0. 


2.4  Analysis  with  cr2  known 


This  section  illustrates  the  analysis  of  the  model  v  hen  <r2  is  assumed  to  be  known.  It  is 
often  unrealistic  to  make  this  assumption  in  p'actice,  although  much  of  the  literature  on 
dynamic  models  treats  cr2  as  either  known  (see,  for  example,  Meinhold  and  Singpurwalla 
1983),  or  well  estimated,  ^  :b~  the  analysis  of  the  model  can  be  performed  conditional 
on  the  estimate.  We  develop  here  the  methodology  for  analyzing  data  under  the  dynamic 
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model  with  Known  <r2,  and  apply  these  methods  in  Section  2.5  where  we  assnme  a2  is  an 
unknown  parameter. 

Let 

dt  —  {Observed  data  for  season  t} 

Dt  =  (di,di,...,dt) 

—  {Observed  data  through  season  t) 

By  convention,  we  let  Do  be  the  state  of  the  dynamic  system  before  data  is  observed. 

For  this  section,  we  are  primarily  interested  in  obtaining  the  distribution  of 
the  posterior  distribution  of  0 ^  marginalized  over  all  other  parameters,  and  (dj+ilDx), 
the  forecast  distribution  of  score  differences.  The  posterior  distribution  (0^t^\Dt)  may  be 
of  interest  in  ranking  teams  after  a  season,  making  use  of  previous  seasons’  results.  The 
forecast  distribution  for  next  season’s  results  may  be  used  to  find 

Pr(j/fJ+1^  >  0|I?t)  =  Pr(i  defeats  j  in  season  (T  +  1)  given  available  data). 

In  this  section,  we  obtain  these  results  conditional  on  a2. 

The  full  posterior  distribution  given  o 2  can  be  written  as 

/(©W  ...,0<T>,#r2,.Dr) 

«  {/(*k2,  Do)}  x  {/(©(1>|*,  <r2,  Do)} 

x{/(«M*1U,*2,A>)} 
x  ft  {{/(*(0I*(‘"1) . ©^W2,  A- a)} 

t=2 

X  {/(d«|©W  ...,©W,*,<72,A-l)}} 
or  equivalently  under  the  probability  model  as 

/(0^\...,^T\<f,\<r2,DT) 

<x  {exp(-u<1>e(1)^/2)^1)/2-1} 

x{exp(-^(©<1)  -  m(1))T«(1)(©(1)  -  M(1)))} 

x{^n(1)/2exp(-|(y(,)  -  -  X(1>0{1)))} 

x  ft  {{exP(-2^(®(<)  ~  «(t-1))T(e(0  ~  «(<-,)))} 

x  {<^n<‘)/2exp(-^(yW  -  X(‘>©W)T(y<‘>  -  X^©W))}J  (2.13) 

We  are  primarily  interested  in  obtaining  the  distribution  of  (0^T\  <f>\ cr2,  Dt),  so  we  inte¬ 
grate  the  full  posterior  distribution  in  (2.13)  over  the  parameters  6(1\ . .  This 
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is  accomplished  through  a  recursive  updating  and  forecasting  operation.  An  updat¬ 
ing  operation  is  the  Bayesian  calculation  that  obtains  the  distribution  of  ,<p\<r2,  Dt) 
from  a2,Dt-\)  after  acquiring  data  at  time  t.  A  forecasting  operation  is  the 

Bayesian  calculation  that  obtains  the  distribution  of  (9^t+2\ 4>\r2 ,  Dt)  from  the  distri¬ 
bution  (0^,d>|<r2,Ht),  reflecting  the  additional  uncertainty  due  to  the  passage  of  time. 
Repeated  application  of  updating  and  forecasting  operations  results  in  the  distribution  of 
(0^T\  4>\<t2 ,  Dt)  from  that  of  \a2.  Do)- 


2.4.1  Updating  and  Forecasting  Recursions 

We  demonstrate  the  updating  and  forecasting  recursions  by  considering  the  state  of  the 
dynamic  system  prior  to  tournament  t.  The  prior  distribution  on  <}>\cr2,  Dt-\ )  is 

(0(<),  *l<r 2,  Dt. i)  ~  NG(M{,),e(t),  R{t),  v <*>) 

as  in  (2.11).  We  now  observe  dt,  consisting  of  the  nW  outcomes  of  paired  comparisons  of 
tournament  t,  and  by  the  methodology  of  Section  2.2  we  obtain 

(9^,<i>\a2,Dt)  ~  (2.14) 

where 

&{t)  =  JlW  +  x(t)TX(t) 

fi'W  =  R'(‘)  +  X(‘)Ty(t)) 

*'<*)  =  /)  +  nW 

£’<*>  =  ^y[(v(1)^(t)  +  M(°Ti2(<V(t))  +  y(t)Jy(t)  ~  /i'(<)TR'(‘V(<)]  (2.15) 

This  completes  an  updating  calculation. 

To  demonstrate  the  forecasting  calculation,  we  assume  the  distribution  in  (2.14)  has 
been  obtained.  The  conditional  distribution  of  is  specified  as 

(0(‘+1)| *(‘U,<r2,A)  ~  N(0<‘>,<7%), 

which  can  be  interpreted  as  incorporating  the  additional  uncertainty  in  the  team’s  abilities 
at  time  t  +  1.  The  joint  distribution  of  parameters  before  observing  paired  comparison 
data  at  time  t  +  1  is 

f(9<t+1\ 4>\o2,  Dt)  =  /(««  4>\o2,  Dt)  x  /(0<t+1>|0<‘>,  <f>,  o2,  Dt) 

a  exp{_^(0  _  _  M'(‘))^/ 2 

x  exp(— 

X  exp(--^(0(<+1)  -  0<<>)T(0(H'1)  -  0(‘>)). 

2(7* 
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Marginalizing  over  0^  yields 

f{0^\4>\a\Dt)  =  j  f{0^\0^A\a\Dt)d^ 


where 


«  exp{-|(0<‘+1>  -  M<‘+D)TJi<‘+i)(0(‘+i)  _  m(«+D)}<^/2 

x  exp(-i^(‘+1>e(‘+1))^,+1>/2-1, 


R(t+i)  =  («y(,)"1  +  ^)iPr1 
M(‘+ 1)  =  M'(‘> 

„(t+i)  _  v>(t) 


£(t+ 1)  _  ^(0. 

This  marginalization  gives  the  distribution  of  (0*t+1),  <f>)  given  Dt,  that  is, 
{0^x\4>\a2,Dt)  ~  NG(M(*+1),f(‘+1),H(‘+1),t;('+1)). 


This  defines  a  single  iteration  of  the  forecasting  calculation.  Note  that  the  only  difference 
between  the  distributions  of  (0^,  ^|<r2, Dt)  and  (0*<+1\ d>|o2, Dt)  is  that  the  conditional 
variance  of  the  rating  parameters  in  the  latter  distribution  has  been  increased  by  o2Ip. 


2.4.2  Predictive  distribution  of  (dt+i\<r2 , Dt) 

The  forecast  distribution  (dt+i\cr2,Dt),  t  =  0,...,7  or  the  observations  to  be 

observed  during  season  t  +  1  can  be  obtained  in  the  following  manner.  We  have 

<P 

{0(t+V\<j>,a2,Dt)  ~  N(/i(1+1),(<>iJ(t+1))_1) 

(<p\<r2,Dt)  ~  Gamma(n<t+1V2.  »(,+1)e(1+1)/2). 

The  conditional  distribution  of  dt+1  given  <f>,  but  marginal  over  is  normal,  and  has 

mean  and  variance  given  by 

E(dt+i  |  X(t+1),  <(>,  <r2,  Dt) 

=  E(E(d1+i|0*t+1\  <j>,  <r2 ,  Dt)) 

=  E(X<t+1>0<‘+1  V‘+1),  <t>,  Dt)  =  X(f+1  V‘+1) 


Var(d«+1 |X*<+1\ /z(t+l),  *(t+1),  <t> ,  <r2,  Dt) 

=  E(Var(dt+1|0<‘+1\  R)t+l\<f>,a2,Dt))  +  Var(E(d<+1|0(‘+1),il<<+1),^tr2,A)) 
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=  E(ilB(,+l))  +  Var(X<‘+1>0<‘+1>) 

9 

=  i(In(.+l)  +  X(t+l)R(t +1)  "1X(‘+1)T). 

9 

This  implies 

(dt+l,4>\<r2,Dt)  ~  NG(X<‘+V‘+1)>*(‘+1),I„(.+i)  +  X<‘+1>J2<‘+1>  -1X<‘+1>T,t;<t+1>), 

from  which  we  obtain  the  standard  result  that  the  distribution  marginal  over  <t>  is  given 
by  the  i-distribution 

(dt+ak2,A)  ~  Twt,+1)(X(‘+1)M(‘+1),^‘+l)( I„(.+i)  +  X<‘+1>.R<‘+1>  -1X<‘+1>T)).  (2.16) 
The  univariate  predictive  distribution  for  y|j+1^  is 

(yg+V  ,A)  ~  T^Ut+1) -MJ‘+1),|<‘+1)(1  +(J*<‘+1>)«1  +(JB<‘+l))->  -2(Ji“+l>)-1)). 

From  this  distribution,  it  is  straightforward  to  calculate  Pr(yj;<+1^  >  0|<r2,  Dt)  numerically. 


2.5  Analysis  with  a2  unknown 


In  most  paired  comparison  experiments  the  extra  variance,  a2,  describing  the  magnitude 
of  changes  in  the  rating  parameters  between  successive  competitions  is  unknown,  and 
should  realistically  be  treated  as  an  unknown  parameter  in  the  analysis  of  the  model. 
Furthermore,  it  is  often  of  direct  interest  to  learn  from  the  data  what  are  plausible  values 
of  a2 .  In  this  section,  we  explore  two  approaches  to  analyzing  the  model  with  unknown 


To  motivate  our  two  approaches,  suppose  we  assume  the  dynamic  model  of  (2.8)- 
(2.12).  We  observe  T  tournaments,  and  are  interested  in  the  marginal  posterior  distri¬ 
bution  {9^T\<f>\Dj).  To  obtain  this  distribution,  we  can  marginalize  over  the  conditional 
distribution  given  w  =  l/<x2, 

/(^UjDr)  =  J  f{6P\<t>\u>,DT)  f(u\DT)dw. 

The  first  density  in  the  integrand  is  normal-gamma,  and  is  calculated  by  the  updating 
and  forecasting  recursions  for  known  a2  of  Section  2.4.  The  second  density  is  difficult  to 
obtain  in  closed  form.  We  propose,  therefore,  two  approaches  to  the  analysis.  One  ap¬ 
proach  involves  drawing  a  large  random  sample  from  the  marginal  distribution  of  (w| Dj) 
and  uses  the  sample  as  an  approximation  to  the  exact  distribution.  This  is  accomplished 
using  the  Gibbs  sampler  as  described  in  Section  2.5.1.  Alternatively,  we  can  approximate 
the  prior  distribution  of  «  by  a  discrete  distribution,  so  that  the  posterior  probability 
distribution  of  (u\Dj-)  is  easily  calculated.  This  approach,  a  non-iterative  analysis,  is  de¬ 
scribed  in  Section  2.5.2.  In  either  case,  to  compute  the  marginal  posterior  distribution  of 
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{6^tK^\Dt),  the  first  conditional  density  in  the  integrand  is  integrated  over  the  approxi¬ 
mate  distribution  of  (u>|.Dx),  and  the  computation  of  marginal  posterior  distributions  of 
or  dx+i  becomes  a  more  tractable  problem.  Section  2.7  discusses  how  to  obtain  these 
marginal  posterior  distributions. 


2.5.1  Analysis  using  Iterative  Simulation 

The  Gibbs  sampler,  an  iterative  simulation  technique  for  drawing  samples  and  summariz¬ 
ing  parameters  of  distributions  that  would  otherwise  be  difficult  to  examine,  is  a  compu¬ 
tational  approach  that  is  being  used  increasingly  in  the  analysis  of  Bayesian  models.  The 
algorithm  was  first  described  in  detail  in  Geman  and  Geman  (1984),  though  many  of  the 
ideas  underlying  the  Gibbs  sampler  have  been  developed  in  previous  work  (Besag  1974; 
Hastings  1970;  Metropolis  et  al.  1953).  More  recently,  Gelfand  and  Smith  (1990)  develop 
the  Gibbs  sampler  methodology  in  a  general  framework  and  discuss  the  properties  of  the 
algorithm.  The  flexibility  of  the  iterative  simulation  approach  has  enabled  the  analysis  of 
Bayesian  models  in  which  closed  form  posterior  distributions  are  difficult  or  impossible  to 
obtain. 

To  describe  the  use  of  the  Gibbs  sampler,  suppose  we  have  m  random  variables 
Zi,...,Zm,  and  we  are  able  to  specify  and  generate  random  samples  from  the  condi¬ 
tional  distributions  (Zi|Zj, .  ..,Zi-\,Zi+ 1, . .  .,Zm),  i  =  1,. .  .,m,  that  is,  the  distribution 
of  each  of  the  Z,  conditional  on  the  rest.  The  Gibbs  sampler  begins  with  an  arbitrary  set 
of  starting  values  z[°\ .  ..,  Z«?\  Then  at  iteration  q,  we  draw 

z\q)  ~  /(z,i4*-l) . z<rl}) 

z^  ~  /(z2|zj,),z<?-1\...,z<r1)) 

&  ~  /(zm|zi,),...,zL’L1). 

Under  regularity  condition?  specified  in  Geman  and  Geman  (1984),  the  distribution  of 
(Zj*\ .  ,.,Zm)  converges  to  the  joint  distribution  of  (Z*, . . . ,  Zm)  as  l  — *  oo. 

Assessing  convergence  of  the  algorithm  is  still  under  investigation  (see,  for  example, 
Gelman  and  Rubin  1992a;  Geyer  1992).  The  approach  used  throughout  this  paper,  pro¬ 
posed  by  Gelman  and  Rubin  (1992a),  involves  running  multiple  Gibbs  samplers  with 
overdispersed  starting  values,  and  monitors  convergence  by  comparing  the  within-sample 
variance  of  single  samplers  to  the  between- sample  variance.  Other  approaches  to  moni¬ 
toring  convergence  are  mentioned  in  Gelfand  et  al.  (1990)  and  Zeger  and  Karim  (1991). 

We  use  the  Gibbs  sampler  in  the  analysis  of  the  dynamic  paired  comparison  model  to 
approximate  the  marginal  posterior  distribution  of  (w| Dj)-  This  is  done  by  alternately 
sampling  from  the  conditional  posterior  distribution  of  Dr)  and  the 

conditional  posterior  distribution  of  (w| 0^\ . .  Dt)-  Liu  (1992)  notes  that  the 
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Gibbs  sampler  will  converge  more  quickly  if  parameters  can  be  combined,  as  in  this  case, 
so  that  rather  than  alternating  among  a  total  of  T  +  2  conditional  distributions,  we  only 
alternate  between  two  conditional  distributions.  After  the  Gibbs  sampler  converges,  we 
continue  the  iterative  simulation  to  obtain  a  random  sample  from  the  marginal  posterior 
distribution  of  (ui\Dx)-  We  now  describe  the  methods  used  to  draw  samples  from  each  of 
the  conditional  distributions. 

Steps  of  the  Gibbs  Sampler 

The  Gibbs  sampler,  which  we  describe  in  detail  below,  proceeds  as  follows  (the  subscript 
c  is  used  to  indicate  parameters  part  of  the  current  Gibbs  draw): 

1.  Pick  a  starting  value,  ue,  for  the  innovation  precision. 

2.  Obtain  the  normal-gamma  distribution  for  (0^T\  <j>\ue,  Dx)  via  the  updating  and 
forecasting  operations  of  Section  2.4. 

3.  Draw  4>c  from 

(<t>\uc,Dr)  ~  Gamma(t^Tty2,  v'^T^T^ /2). 

4.  Draw  0^)  fr0m 

(0^\4>c,Uc,DT)  ~  N 

5.  For  t  —  T  -  1, . . .,  1,  successively  draw  0^  from 

(0(‘>| 0l*+V,...,0p,4>e,u>c,DT)  ~  NfV^J^V*0  +  Wc0(m)),V<“), 

where  VW  =  ( 4>R' W  +  wcIp)  1.  This  completes  the  first  half  of  a  Gibbs  sampler 
iteration,  that  is,  a  draw  from  the  distribution  of  (0^1, . . . ,  0^T\  ^|w,  Dx). 

6.  Draw  uc  from 

1  T~l 

(«l 0?\...,O(Pac,Dt)  ~  Gamma(ao+p(7’-l)/2,6o+“ 

z  <=i 

This  completes  the  second  half  of  a  Gibbs  sampler  iteration. 


Repeat  steps  (2)  through  (6)  until  convergence. 

Sampling  from  (0(1), . . . ,  0(T\  <f>\ |u>,  Dj ) 

In  Section  2.4,  we  describe  how  to  obtain  the  posterior  distribution  of  Dr) 

by  the  updating  and  forecasting  recursions  (steps  (2)-(4)).  Now  we  are  interested  in  the 
posterior  distribution  of  the  remaining  parameters  conditional  on  u>. 

Let  u>c  be  the  current  draw  of  u  in  the  Gibbs  sampler.  To  draw  from  the  joint  distribu¬ 
tion  of  (0*‘\ . .  .,0(T\<p\tjJc,Dx),  we  invoke  the  following  argument.  Suppose  we  want  to 
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draw  from  the  joint  distribution  of  random  variables  ( Aj,  A'j, ....  A'n).  To  do  so,  we  can 
first  draw  from  the  marginal  distribution  of  A'n,  and  then  draw  from  the  conditional  distri¬ 
bution  of  (A„_i|A„),  (A'n_2|An_i, A„),  and  so  on,  until  we  draw  from  (Aj|A'2 . A'n). 

This  process  gives  a  single  draw  from  the  joint  distribution  of  all  variables. 

For  the  current  problem,  we  can  rewrite  the  joint  posterior  of  ( 0(1K  . . . ,  0(T>.  c>|-.'c,  Dr) 
as  a  product  of  conditionally  independent  densities, 

/(0{1>, . . . , 0(TK  4>)uc,  Dt)  =  /(*(TU |wc,  Dt)  X  /(0(T*1>|0<T>,  we,  Dr) 

x  •  •  ■  x  /(•*») |*a>, . . ., 0<T), 0,we,  r>r).  (2.17) 

This  factorization  suggests  that  to  draw  from  the  joint  distribution  of  T  +  1  parameters 
on  the  left-hand  side  of  (2.  .7),  we  can  successively  draw  from  each  of  the  conditional 
distributions  on  the  right-hand  side.  Given  a  particular  value,  u>c,  of  the  system  precision, 
we  can  compute  the  distribution  of  (&T\<f>\ujc,  Dt)  using  the  updating  methodology  of 
Section  2.4. 

To  understand  how  to  draw  from  the  distribution  of  . . .  ,0^K  d>c.u>c,  Dt), 

t  =  1, . . .  ,T— 1,  it  is  helpful  to  write  down  the  joint  posterior  distribution  of  all  parameters. 
Conditioning  now  on  ljc,  we  have 

f(0^K...,^T\^e,DT) 

«  {/(#*,  A>)} 
x{/(0<‘)|*,u,c,A>)} 
x{/(di|0(,U,u>c,A>)} 

x  n  {</(*  v*-1*,  •  •  - .  < ,  a-i)} 

1=2 

X  {/(dt |*«> . *<*>,*,<*, 

which  we  can  write  as 

f(&'\...,#T\<t>\uc,DT) 

oc  {exp(-t;(1>e<1W2)^w<,)/2"1} 

x{exp(-|(0(1)  -  #»i,-,)TJl(I>(0<,)  -  m(1)))} 

x{<An(,)/2exp(-|(y<1)  -  X<1>0<1))T(V<1)  -  JfOty1)))} 

x  I]  (  {exp("(0(O  ~  •<f”,))T(«W  - 

t=2  ^  ^ 

x  {^(,)/2exp(-^(y<‘)  -  JT«0«)T(*(f>  -  X(t)0(‘>))}  j  . 

We  now  compute  the  joint  posterior  distribution  marginalized  over  . . . ,  This 

is  accomplished  by  performing  the  updating  and  forecasting  computations  of  Section  2.4 
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to  obtain  the  marginal  distribution  of  (0^,d>ju>c,  Dt).  This  results  in 

f(OW,...,0lTK<l>\ue,DT) 

cx  /(0<‘U|u,c,  £>,)  x  f(0lt+l\...,OlT'\0lt\<p,u;c,DT) 

oc  {exP(-v,^'(t]4>l2)4>v'(,>/2-1} 

x{exp(-|(e(l)  -  m'<‘))tij'(‘)(0<‘>  -  M'<‘>))} 
x{0n(‘+1)/2exp(-^(y(l+1)  -  A:(‘+1)«<t+1))T(y<‘+1>  -  X<t+1^(‘+1)))} 
x{exp (-^(0<‘+2>  -  *<‘+1>)T(*(‘+2>  -  *(t+1)))} 

x{exp(_|L(0(T)  _  d(T-i))T (0{T)  _  0(T-i)))} 

x{^<r,/2exp(_|(j,(T)  _  XWeW)T(yW  -  X<TWT>))}.  (2.18) 

The  conditional  distribution  of  ,  0^T\  <t>,  uc,  Dt)  is  proportional  to  the  marginal 

posterior  distribution  in  (2.18)  treating  the  parameters  0*‘+1), . .  as  constants. 

This  is  simplified  by  neglecting  the  terms  that  are  constant  with  respect  to  0(t\  which 
yields 


oc  {exp(-|(0W  -  m,(‘>)t.R,<*>(0(‘)  _  M'(0))} 
x{exp(-y(0<‘+1>  -  0l‘>)T(0<‘+1)  -  »<*>))}. 

Thus,  conditional  on  0^t>  is  independent  of  the  data  dt+i,. .  .,dx.  We  therefore 

have 

(0<‘)|0<t+i)? . . 0<T>, <f>, uc,  Dt)  ~  N(V<t)(d>jr'(t)fi'<t)  +  vc0{t+1)),  V<‘>),  (2.19) 

where  V W  =  (<j>R'W  +o;cIp)-1. 

So  to  perform  the  iterative  sampling  procedure  for  drawing  from  (0*1), . . . ,  0^r\4>\uc,  Dt), 
we  first  draw  0^  and  <j>c  given  we  and  Dt  as  in  steps  (3)  and  (4).  In  obtaining  the 
normal-gamma  parameters  for  (0^,^|we,  Dt),  we  retain  during  the  updating  and  fore¬ 
casting  operations  the  normal-gamma  parameters  for  {0^\  <t>\wc,  Dt),  t  =  1  These 

parameters  are  computed  in  the  intermediate  steps  as  in  Section  2.4.  Now  from  (2.19), 
we  can  draw  from  <pc,ujc,  Dt),  and  recursively,  we  can  draw  from 

(0<‘>|0(<+1\...,CUc  ,u>c,  Dt)-  This  process  gives  us  a  complete  single  draw  from  the 
conditional  distribution  of  (0^1\...,6^T\<f>\wc,DT),  which  completes  the  first  half  of  a 
single  Gibbs  sampler  iteration  (step  5). 

Sampling  from  (w)#^, . . . ,  6{/\  <t>c,  Dt) 
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To  perform  the  second  half  of  one  iteration  of  the  Gibbs  sampler,  assuming  we  have 
draws  0^, . . . ,  0^T\  <t>c,  we  now  want  to  draw  uc  from  the  conditional  distribution  of 
(w|0^, . . . , Dt)-  Suppressing  t -  conditioning  on  hyperparameters,  we  have 

0<TU,  u;  I  I>r) 
a  {u/,a°-1e-6ou'} 

x{exp(-u<1^1^/2) 

x{exp(-|(0(,)  -  m(I))TH<1)(*(1)  -  j^))} 
x{d>nll)/2exp(-  j(y(1)  -  X(1>0(1))T(y(1)  -  X(1)0(1>))} 

T 

x  fl  {{u>*'2exp(~|(0<*>  -  -  0<‘-1>))} 

x  {^n<')/2exp(-|(y(‘)  -  xWfl(‘>)T(yW  -  X(t)0(t)))}} 


The  conditional  posterior  density,  /(u;^1), . . . ,  6^\  <f>c,  Dt),  is  proportional  to  the  above 
density.  Ignoring  terms  that  are  constant  with  respect  to  u,  we  have 


/MflW  ...,0(TU,.Dt) 

oc 


x  n  {{u^2exp(-|(0<‘>  -  0<‘-»)T(0<‘>  -  **-»>))}} 


oc 


t=2 

{wO0"1e_i>0ta'} 


x{u*T-V/2exp{~  £(0(1+1>  -  0<‘>)T(0<t+1>  -  0<‘>))}, 

2  t=i 


so  that 


1  T_1 

M0(1),...,0^,<A,£>T)~Gamma(ao+p(r-l)/2,6o+^  £(0<<+1>-0<‘>)T(0<<+1>-0<‘))). 

2  t=i 

(2.20) 

Therefore,  to  complete  the  second  half  of  one  iteration  of  the  Gibbs  sampler  (step  6),  we 
randomly  draw  uc  from  a  Gamma  distribution  conditional  on  the  parameters  drawn  from 
the  first  half  of  the  Gibbs  sampler  iteration.  When  the  successive  0^  and  0<t+1'  are  close 
to  one  another,  the  sum  of  squared  deviations  in  (2.20)  will  be  small,  so  the  distribution  of 
u>  has  large  mean.  This  indicates  that  the  amount  of  variability  between  competitions  is 
small,  which  seems  consistent  with  the  successive  0^  and  0(t+1)  being  close.  Conversely, 
when  the  sum  in  (2.20)  is  large,  the  value  of  u  drawn  from  the  distribution  is  likely  to  be 
small,  corresponding  to  a  large  system  variance,  a2. 


Carlin,  Poison  and  Stoffer  (1992)  develop  a  methodology  for  modeling  a  large  class 
of  state-space  models  using  the  Gibbs  sampler  to  approximate  the  posterior  distribution 
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of  parameters.  Their  approach,  which  can  be  used  to  analyze  the  normal  dynamic  linear 
models  of  the  form  (2.8)-(2.11)  as  a  special  case,  involves  constructing  a  Gibbs  sampler 
that  alternately  draws  from  many  conditional  distributions.  However,  as  demonstrated 
in  Liu  (1992),  if  some  parameters  can  be  “grouped”  into  joint  distributions  (conditional 
on  the  rest),  then  the  resulting  Gibbs  sampler  is  a  more  efficient  procedure  in  the  sense 
of  faster  convergence  to  the  unconditional  posterior  distribution.  While  Carlin,  Poison 
and  Stoffer  (1992)  iterate  among  many  conditional  distributions,  our  approach  involves 
iterating  between  exactly  two,  so  that  our  Gibbs  sampler  will  converge  at  least  as  quickly. 


2.5.2  Analysis  using  a  Discretized  Prior  on  u 

An  alternative  approach  to  obtaining  the  marginal  posterior  distribution  of  0^  involves 
using  a  discrete  approximation  to  the  prior  distribution  of  u.  This  approximation  may 
be  done  in  any  of  several  reasonable  ways.  First,  we  may  draw  a  random  sample  of  size 
m  from  a  continuous  prior  distribution,  in  which  case  the  empirical  distribution  of  the 
sample  approximates  the  true  prior  distribution.  Another  possibility  is  to  partition  the 
range  of  u  into  m  intervals,  compute  the  probability  associated  with  each  interval,  and 
compute  the  centroid  of  each  interval.  A  third  approach,  which  seems  the  most  reasonable 
to  use  in  practice,  is  to  select  m  grid  values  of  u  that  are  overdispersed  for  the  distribution 
of  u>,  and  assign  probabilities  for  each  value  corresponding  to  prior  beliefs. 


Suppose  we  have  selected  a  set  of  to  values  of  u,  and  assume  we  have 

approximated  the  prior  distribution  by  the  probability  function 


f  *(0) 


if  u  =  uii 
\iu  i  {<*>!, 


•>um} 


where  we  assume  f(u\Do)  has  mass  only  on  the  set  {wlt . .  .,wm).  For  any  season  t,  we 
have  by  Bayes’  theorem 


f(u\Dt) 


M\Dt-i) 


/MA-i). 


The  marginal  distribution  of  u  given  data  through  season  t  has  a  simple  relationship  to 
the  distribution  of  u  given  data  through  season  t  —  1.  Continuing  the  recursion,  we  have 
for  all  t 


f(u\Dt)  = 


cx 


nUfWDj-i,*) 
n'=1  f(dj\Dj-i) 

Kurntmo^u). 

j=i 


Thus  the  marginal  density  of  {u\Dt)  is  proportional  to  the  product  of  the  prior  density 
of  (u\Do)  and  a  reweighting  factor,  which  is  the  product  of  conditional  likelihoods  of  the 
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form  The  reweighting  factor  for  a  specific  value  of  lj  is  a  measure  of  how 

likeJy  the  data  were  to  occur  given  the  previous  data  and  the  specific  value  of  u.  When  the 
reweighting  factor  is  small  for  a  given  value  of  u>  (relative  to  other  values  of  u),  the  data 
is  not  well  predicted  frorr.  the  given  value  of  u,  so  the  posterior  probability  of  that  specific 
value  of  u>  is  small.  From  Section  2.4.  (dt|I5t_i,w)  has  a  multivariate  t-distribution  with 
density 

f{dt\u,  Dt-x)  oc  +  (dt  -  ~\dt  -  AT(V‘))/*(‘)r<v(,)+B(,,)/2, 

with  CW  =  (7n(«)  +  XWrW  Therefore,  the  posterior  distribution  of  u  is  ob¬ 

tained  by  reweighting  the  prior  distribution  of  u;  by  a  product  of  multivariate  t-densities, 
so  that 

x(T)  = 

It  is  therefore  a  straightforward  procedure  to  approximate  the  posterior  distribv  on  of  ui. 

The  attractiveness  to  this  approach  over  the  iterative  simulation  approach  is  its  lower 
computational  cost.  In  performing  iterative  simulation,  let  N  denote  the  number  of 
iterations  performed  in  a  single  Gibbs  sampler,  and  let  g  be  the  number  of  parallel  Gibbs 
samplers  run  to  assess  convergence.  We  must  perform  Ng  forward  filters,  that  is,  the 
computation  that  produces  the  parameters  for  the  distribution  of  (&T\  <f>\u,  Dj)  from 
Dq).  In  addition,  we  must  run  Ng  back  filters,  that  is,  the  computation  that 
produces  the  distributions  of  Dt),  t  =  T  —  1,...,1.  In  the 

non-iterative  analysis,  we  only  need  to  run  m  forward  filters  and  no  back  filters.  If  we 
set  m  =  150,  g  =  6  and  N  =  500,  all  reasonable  parameter  values,  then  the  iterative 
approach  is  20  times  slower  than  the  non-iterative  simulation  approach  of  this  section, 
and  40  times  slower  if  back  filtering  is  considered  as  expensive  as  forward  filtering. 

The  main  disadvantage  of  the  non-iterative  approach  relative  to  the  Gibbs  sampler 
stems  from  choosing  the  initial  set  of  values  of  u.  If  the  data  show  that  the  support  of 
(uj\Dt)  is  not  well  represented  by  the  set  {wi, . .  .,wm}  so  that  the  prior  distribution  on  u 
is  misspecified,  the  posterior  distribution  of  [u}Dj)  will  not  be  approximated  accurately. 


2.6  Inclusion  of  Non-dynamic  Linear  Covariates 

Paired  comparison  experiments  often  include  data  which  might  be  usefm  predictors  of  the 
outcome  of  future  comparisons.  The  model  stipulated  in  Section  2.3  can  be  extended  to 
include  linear  covariates  whose  distribution  does  not  change  due  to  the  passage  of  time. 
In  this  section,  we  describe  a  model  that  includes  linear  non-dynamic  covariates  into  the 
dynamic  model,  and  show  how  to  modify  the  analyses  of  Sections  2.4  and  2.5  in  the 
pre  °nce  of  these  covariates. 
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Suppose  we  observe  with  each  paired  comparison  a  set  of  q  variables.  Let  to 
be  the  n M  x  q  covariate  data  matrix,  and  let  /?  be  the  vector  of  q  parameters  associated 
with  the  variables  where  ft  is  assumed  to  remain  constant  over  time.  The  model  for  score 
differences  at  time  t  is  now  given  by 

Vw  ~  N(X?W.‘> ,k,(o), 

9 

where  X is  the  n  x  (p-fg)  concatenation  of  and  WW,  and  6 is  the  (p  +  g)- vector 
(6^ ,P).  The  probability  model  that  describes  the  evolution  of  the  parameters,  6[f\  is 
given  by 

The  parameters  /?,  the  last  q  components  of  are  assumed  to  remain  constant  over 
time,  so  we  assume 

(t'i'V2)  ~  N(a.t,tr2JPf,), 

where  a.,  =  (at,0,0, ...,0),  that  is,  the  vector  at  followed  by  q  zeroes,  and  Jpq  is  the 
p  +  q  dimensional  square  matrix  consisting  of  zeroes  everywhere,  except  that  the  first  p 
diagonal  elements  are  equal  to  one.  This  formulation  of  the  model  specifies  that  the  first 
p  components  of  di*1  are  dynamic  and  are  allowed  to  change  over  time  according  to  the 
model  in  (2.9).  The  last  q  elements  of  remain  fixed  over  time,  and  as  more  data  are 
accumulated  these  parameters  are  estimated  with  greater  precision. 

Analysis  conditional  on  a2 

The  analysis  of  Section  2.4  where  a 2  is  known  remains  essentially  the  same  with  a  few 
modifications  in  the  updating  and  forecasting  formulas.  The  formulas  in  (2.15),  which 
update  the  normal-gamma  parameters  upon  observing  new  data,  do  not  change  with  the 
addition  of  non-dynamic  parameters, 

pd*)  _  p(4)  1  yd*)"1’  ydO 

^  =  **>-‘(iv>+x.  i‘)v>) 

£'«>  =  -^[(n^W  +  +  V(t)Ty(t)  -  M?‘)TJH(  V0]  (2.21) 

The  forecasting  equations  change  slightly  because  the  non-dynamic  parameters  do  not 
increase  in  variance  due  to  the  passage  of  time.  We  therefore  have 


Bit+1) 

£(*+i) 


/pd4)- 1  1  0  t  1 

'(<) 

v'W 

fd«). 


(2.22) 
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This  demonstrates  how  to  calculate  the  marginal  distribution  of  given  a7  through  an 
updating  and  forecasting  recursion. 

Gibbs  Sampler  analysis 


The  use  of  iterative  simulation  in  Section  2.5.1  undergoes  several  important  changes  in 
the  presence  of  non-dynamic  parameters.  The  goal  of  the  Gibbs  sampler  is  to  sample  alter¬ 
nately  between  the  distribution  of  and  (u>|0^, .  ..,0*T\ /?,</>,  Dy). 

The  difficulty  in  this  analysis  is  that  drawing  0^  is  affected  by  the  presence  of  the 
non-dynamic  parameter  0.  The  strategy  we  employ  is  to  factor  the  distribution  of 
(0^\ . . . ,  0^T\0,  4>\u>,  Dt)  into  a  product  of  conditional  densities, 

f(0,4>\u,DT)  X  . 0<T)| 0,d>,u,DT) 

To  perform  the  sampling,  we  obtain  the  normal-gamma  distribution  for  ( 6 —  {6^\  0),<f>\u>,  Dt) 
via  the  updating  and  forecasting  formulas  described  above.  We  next  sample  <f>c  bom  its 
gamma  distribution,  and  then  sample  0e  conditional  on  4>c  from  its  normal  distribution. 

To  sample  the  0^,  t  =  1, . . .,T,  we  need  to  recompute  the  conditional  normal  distribu¬ 
tions  of  (0^\0c,4>c,dt).  This  can  easily  be  shown  (see,  for  example,  Dillon  and  Goldstein 
1984,  pg.  546)  to  have  the  distribution 


(0(t)|  Pc,<t>c) 


N(#*g  +  Egsg'^g  -  &),  E'g  -  EgsJJ '^g). 


where 

E'(t)  =  . 

The  matrix  E'^  and  the  vector  /i'M  are  indexed  by  8  and  0  to  denote  the  respective 
subsets  corresponding  to  these  parameters.  Once  these  conditional  distributions  have 
been  obtained,  we  need  to  perform  a  recursion  analysis  similar  to  the  conjugate  normal- 
gamma  updating  and  forecasting  equations  of  Section  2.4.  This  analysis  is  condition^,  on 
4>c,  so  that  updating  and  forecasting  corresponds  to  calculating  parameters  for  posterior 
normal  distributions  rather  than  normal-gamma  distributions.  This  needs  to  be  carried 
out  in  order  to  calculate  the  parameters  of  the  distributions  of  (0^\<f>c,0c,u,  Dt)  for  all  t, 
from  which  we  will  obtain  the  desired  joint  posterior  distribution. 


The  normal  based  updating  and  forecasting  recursions  are  straightforward.  Suppose 
we  have 


(0W\<f>c,(3c,uc,Dt-i)  ~  N (m<‘),C<‘>). 

Then  upon  observing  new  data,  dt,  the  updating  recursion  obtains  the  distribution 

{6{t)\<t>c,Pc,Uc,Dt)  ~  N(mg,cg), 


where 

C'g  =  (C  P'1  +  Lt)~1)-1 

mg  =  cgfC^m^  +  EW-VW), 
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where  and  are  the  conditional  means  and  variances  for  the  t-th  competition 
conditional  on  0C  and  <pe. 

The  forecasting  recursion  which  obtains  the  distribution  of  parameters  before  the  t  + 1- 
th  competition  is  given  by 

(0(m)| <t>c,0c,ue,Dt)  ~  N(m<‘+1),C<‘+1)), 

where 

(t+i)  /(<) 

me  =  m« 

C#+1)  =  C'e(t)  +  <r%. 

These  two  recursions  are  performed  alternately  in  succession  to  obtain  the  posterior  dis¬ 
tribution  of  (0(T)|<£C,  /3c,uc,  Dt)- 

To  obtain  a  sample  of  draws  of  the  t  =  T  —  1, . . . ,  1,  we  first  draw  0^  from  the 
distribution  of  (0^\<f>e, 0e,uc, Dt)-  Then  for  t  =  T  —  1, ....  1,  we  successively  draw  from 

(0(‘W+1), •  ..,O?\<t>e,0,ue,DT)  ~  N(V(‘)(C;(t) +  u,0<‘+1>),  V<‘>), 

where  =  (Cg^  -1  +  u>Ip)~1.  The  result  of  this  process  is  a  single  draw  from  the 
distribution  of  (0^, . .  .,0(T\(3,4>\u,  Dt)- 

Drawing  from  the  distribution  of  (wl^1), . .  .,0(T),/?,  <t>,  Dt)  is  identical  to  (2.20)  in 
Section  2.5.  Only  the  p  dynamic  parameters  of  0^  are  used  to  compute  the  parameters 
of  the  gamma  distribution  because  the  conditional  distribution  of  u  does  not  depend  on 
the  non-dynamic  components. 

Non-iterative  analysis 

The  addition  of  non-dynamic  parameters  to  the  analysis  of  Section  2.5.2  presents  no 
obstacles.  Here  the  non-iterative  analysis  requires  the  computation  of  f{d.t\Dt~\,u),  for 
all  t,  in  order  to  obtain  the  posterior  distribution  of  u.  This  is  accomplished  by  computing 
the  normal-gamma  parameters  for  the  distribution  of  (0^,  <p\u,  D<_i)  using  the  updating 
and  forecasting  equations  (2.21)  and  (2.22),  and  calculating  the  values  of  the  t-densities 
given  these  parameter  values  for  each  v,  and  each  season  t.  No  other  modifications  are 
necessary  to  the  analysis  of  Section  2.5.2. 


2.7  Marginal  and  Forecast  Distributions 


In  practice,  we  are  primarily  interested  in  making  inferences  on  0^  and  on  This 

involves  marginalizing  the  distribution  of  (0^t\cj\Dt)  or  (y(T+1),w| Dt)  over  the  nuisance 
parameter  u>.  In  this  section,  we  describe  an  approach  using  the  results  of  Section  2.5  to 
marginalize  over  u ;. 
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2.7.1  Marginal  Distribution  of  ( o\Dt ) 


The  methods  of  Sections  2.5.1  and  2.5.2  provide  an  approximate  posterior  distribution  for 
(u\Dt),  and  therefore  for  {o\Dj),  as  a  discrete  distribution.  In  the  case  of  Section  2.5.1, 
the  Gibbs  sampler  results  in  a  random  sample  of  m  draws  from  the  posterior  distribution 
of  ( u\Dr ),  whereas  Section  2.5.2  results  in  a  discrete  probability  distribution  on  the  m 
values  of  u  which  approximates  the  true  continuous  distribution  oi  (w|2>r)-  In  both  cases, 
we  have  a  discrete  distribution  on  a  set  of  values  of  u>  which  approximates  the  continuous 
distribution  of  (u>\Dj).  We  assume,  therefore,  that 


[  t(T) 

f(u\DT)  =  |  J 


for  u  =  Ui 

if  w  ^  {u>i,...,wm} 


(2.23) 


When  {wi,...,wm}  are  drawn  from  the  Gibb.=  sampler,  we  assume  that  x{T*  =  1/m  for 
all  i.  The  approximate  mean  and  variance  of  {o\D?),  where  ax  =  l/v/w7,  are  given  by 


E(<r|Dr)  = 

/  af(a\DT)  da  = 

J  j= i 

VarHDr)  = 

J(a-E(a\DT))*f(a\DT)da 

= 

f;(^-E(a|I?r)Nr). 

»=i 

Higher  order  moments  can  be  calculated  analogously. 


2.7.2  Marginal  Distribution  of 

In  order  to  make  inferences  on  the  distribution  of  (0^T^|Dx),  we  need  to  marginalize  over 
u.  After  tournament  T, 

(0V\<t>lu,DT)  ~  NG(M/(T\£'(T),i^T),t>'(T)). 

Marginalizing  over  <p  gives 

(0<T>|u>.Dr)  ~ 

Therefore, 

/(«m| Dt)  -  J  f(OIT>\w,DT)  f(w\DT)dj 
«=1 

which  is  a  mixture  of  multivariate  t  densities. 
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The  mean  of  this  mixture  distribution  is  computed  as 
E(0<T)|  Dt)  =  E(E(0<t>|w,I>t)) 

=  £(„«•>)= 

«= i 

where  nP  is  the  mean  of  (d^T^\u,,  Dt)-  To  compute  the  variance,  we  first  note  that  for 
multivariate  t  distributions, 

Var(*<T>|Wi,DT)  =  ^r^^(T)^(T)_1- 

where  and  R.fT^  are  indexed  to  show  the  dependence  on  w, .  We  therefore  have 

Var(0(T>|Z>T) 

=  E(Var(0<r>|u;,  Dt))  +  Var(E(0(r)(w,  DT)) 

=  f]  xf)Var(^T)|u;„  Dr)  +  Var(#i{T)) 

«=1 

=  f;  xP )Var(^T>|Wi,  Dt)  +  (£*P(dP  ~  E(e^\DT))(^T]  -  E(0<T>|Z)r))T 

1=1  Vt=l 

(2.24) 

These  computations  allow  us  to  summarize  the  first  and  second  moments  of  the  marginal 
distribution  of  (0^\Dt)- 

Posterior  inferences  on  (0^|Z?t)  from  the  mixture  distribution  are  difficult  to  per¬ 
form  analytically.  Instead,  we  simulate  draws  from  the  mixture  distribution,  and  make 
inferences  empirically  from  the  obtained  sample.  To  simulate  draws  from  the  mixture,  we 
can  draw  from  (w| Dt)  first,  and  then  given  this  value  of  u  we  can  draw  from  the  exact 
multivariate  t  distribution  of  (^T^|w,  Dt)-  This  gives  us  a  single  draw  from  the  marginal 
distribution  of  (6^\Dt)-  To  draw  from  a  multivariate  t-distribution,  we  can  first  draw 
from  the  gamma  distribution  of  (<f> \u>,  Dt),  arid  then  draw  from  the  multivariate  normal 
distribution  of  (0^\<j>,u ,Dt)-  In  either  case,  this  process  is  repeated  until  we  obtain  a 
sample  large  enough  so  that  empirical  inferences,  such  as  credible  intervals  for  parameters, 
can  be  obtained. 


2.7.3  Forecast  Distribution  for  <1t+i 

We  now  show  how  to  obtain  forecast  means  and  variances,  and  prediction  intervals  for 
unobserved  data  dr+i  of  length  n^T+1\  We  first  note  that  from  (2.16)  in  Section  2.4, 

(dT+i\u,DT)  ~  T^T+1)(X(r+1^(r+1,,e(r+1)(I„(r+.)  +  xWrV+V-'xWT)) 

(2.25) 
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so  that  the  predictive  distribution  for  dr+ 1 

f(dT+i\Dr)  =  J  /(^t+i|w,Z>t)/(w|Z>t)  du 

=  f>rV(<*T+ik,l>r)  (2-26) 

t=i 

is  again  a  mixture  of  t  distributions.  The  mean  is  computed  as 

E(dT+i|T>T) 

=  E(E(dT+ik*>,Dr)) 

=  f>jT>x<™viT+1)  =  x(T+,)(f>PVr+1)), 

«=x  t=l 

where  X^T+1^  is  the  design  matrix  for  the  new  data,  and  is  the  mean  of  the  normal- 

gamma  distribution  that  forecasts  from  data  through  time  T.  For  computing  the 

variance,  we  first  note  that 

Var(dT+i  DT)  =  ^^eP*+l)(IrfT*i,  +  X<7>1>*P+1)  -,X<T+l>T) 

so  that 

Var(dT+i|Z)r) 

=  E(Var(dT+i|w,  DT))  +  Var(E(dT4-i|w,  Dt)) 

=  f;  xf(T)Vax(d(r+1)|u;i,  Dr)  +  Var(X'T+ VT+1)) 

i=i 

=  E  ^(T)Var(d(T+1)|Wt,  2?T)  +  X<74  Var(M{T+1))X<T+1> T 

1=1 

=  f;x|r)Var(d(r+1)|Wi,i)T) 

•=i 

+A-(r+l)  ^  TF)(MF+1)  _  E(0(T)|jDr))(M(T+l)  _  Jf  (T+l)  T 

(2.27) 

Thus  calculations  for  the  estimated  mean  and  variance  are  straightforward  from  these 
formulas. 

We  obtain  inferences  from  the  predictive  distribution  in  (2.26)  by  drawing  random 
samples  of  and  reporting  empirically  based  predictive  intervals.  We  draw  from 

(w| Dt),  and  then  given  this  value  we  can  draw  from  the  exact  multivariate  t  distribution 
of  f(dr+i\u,  Dt)  as  given  in  (2.25).  This  results  in  a  single  draw  from  the  marginal 
distribution.  Repeated  applications  of  this  process  yield  a  sample  on  which  to  perform 
empirically  based  inferences. 
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It  should  be  noted  that  treating  w,  and  therefore  <r2,  as  an  unknown  parameter  ap¬ 
propriately  increases  the  estimated  variability  in  both  (0^|Z)r)  and  (dj+]|.Dj-).  This 
is  reflected  by  the  inclusion  in  (2.24)  and  (2.27)  of  the  second  term,  the  variance  of  the 
conditional  mean.  If  u  were  treated  as  known,  this  second  term  would  vanish,  and  the 
variability  of  the  forecasts  would  be  underestimated. 


2.8  Sequential  Updating  with  New  Data 


In  this  section  we  consider  the  situation  in  which  we  have  observed  T  tournaments  of 
paired  comparison  data  and  have  performed  one  of  the  anakses  of  this  chapter  thereby 
obtaining  the  distribution  of  <j>\u,  Dt)  and  the  distribution  of  (u\Dt)-  We  now 
observe  data  at  tournament  T  +  1  and  want  updated  distributions  of  (0^T+1\  <t>\ui,  Dj+i) 
and  {u\Dt+{). 

Naturally,  one  method  that  gives  the  desired  result  is  to  apply  the  Gibbs  sampler 
methodology  of  Section  2.5.1  by  performing  the  entire  Gibbs  sampler  analysis  on  all 
T  +  1  tournaments.  This  will  give  the  desired  posterior  distributions,  but  it  is  extremely 
inefficient  in  that  it  does  not  make  use  of  the  analysis  of  the  first  T  tournaments. 

A  more  efficient  approach  is  based  on  a  non-iterative  updating  algorithm  similar  to 
the  method  of  Section  2.5.2.  In  the  analysis  of  Section  2.5.2,  we  update  the  distribution  of 
(w|D0)  to  (u)\Dt)  by  reweighting  the  initial  probabilities,  x,-0*,  by  the  product  of  marginal 
likelihoods  of  the  data  to  obtain  the  posterior  probabilities,  x-T*.  The  analysis  here  is 
similar,  as  we  want  to  update  the  distribution  of  (v\Dt)  to  (u>\Dt+i)  from  data  observed 
at  tournament  T  +  1. 


We  have  at  time  T 

(i 9(t),4>\u,Dt )  ~  NG(M,(T),£'(r),  J^T\t/(T)), 

and  we  also  have  a  sample  of  m  values,  u\, . . ,  ,wm,  for  which 


/MDt) 


-  /  *!T1 1  k 

\  0  if 


Of  =  <i>; 

m} 


Obtaining  the  distribution  of  (0^r+1\<£|u;,  l?r+i)  is  a  simple  application  of  the  forecasting 
and  updating  formulas  given  in  Section  2.4.  To  calculate  the  posterior  probabilities, 
xf+1>,  for  the  distribution  of  (u\Dt+i),  we  have 


/(w|Dt+i)  = 


f(dT+i\Dr,u) 


f{u,\DT) 


f(dr+i\DT ) 

«  f(dT+i\DT,u)  f(v\DT). 

The  posterior  distribution  of  (u;|Z?x+i)  is  given  by 


/(w|Z?T+1)  =  (  ir«fT+1>  ifw  =  a,‘ 

JK  '  +  >  \  0  ifw*  ,u;m} 
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SO 


with 


t(T+i>  _  *iT)f{dT+i\ui,DT) 

*'  ~Er=i*r/(rfT+iki,DT)' 


Thus  the  updating  of  the  discrete  posterior  distribution  from  to  r{T+1*  is  obtained 
by  reweighting  by  the  relative  likelihoods  of  the  new  data,  which  have  a  multivariate  t 
density.  This  approach  appears  to  be  an  efficient  way  to  obtain  the  approximate  posterior 
distribution  of  (o>|.Dy+i)  as  it  makes  use  of  the  previous  computation  that  resulted  in 
the  posterior  distribution  of  (w|Dj).  The  methods  of  Section  2.7  can  be  used  to  obtain 
the  marginal  posterior  distribution  of  (^t+1^|Z)7>+i)  as  well  as  forecast  distributions  for 
subsequent  tournaments. 


Chapter  3 


Analysis  of  NFL  Football  Game 
Scores 


Using  the  methodology  described  in  Chapter  2,  we  analyze  in  thi6  chapter  National  Foot¬ 
ball  League  (NFL)  game  outcomes  from  1981-1991.  As  a  result  of  our  analysis,  we  obtain 
rankings  of  teams  at  the  end  of  the  1991  season  and  game  predictions  for  the  1992  season. 
In  Section  3.1  we  describe  a  probability  model  for  football  scores  that  incorporates  changes 
in  team  abilities  over  time,  and  the  implementation  of  both  the  Gibbs  sampler  analysis 
and  non-iterative  analysis  of  the  data  under  the  model.  Section  3.2  reports  the  results 
of  the  analyses  of  the  model.  We  obtain  the  posterior  distribution  of  the  between-season 
variance,  the  parameter  describing  the  magnitude  of  the  team  ability  changes  from  season 
to  season.  We  also  compute  posterior  distribution  summaries  and  inferences  for  team 
ratings,  and  obtain  prediction  intervals  for  1992  NFL  games.  To  test  the  adequacy  of  the 
model,  we  perform  diagnostic  checks  on  the  forecast  residuals.  In  Section  3.3,  we  compare 
the  results  of  analyses  based  on  different  amounts  of  data.  We  demonstrate  in  Section  3.4 
how  to  update  the  distribution  of  the  model  parameters  to  account  for  new  data  (the  first 
four  weeks  of  1992  NFL  game  results).  We  conclude  the  chapter  in  Section  3.5  with  a 
discussion  of  the  utility  of  the  model. 

The  data  used  in  the  analysis  consists  of  the  complete  regular  season  results  of  NFL 
football  games  for  the  years  1981,  1983-1986,  1988-1991.  We  did  not  include  games  for 
the  years  1982  and  1987  because  these  were  years  in  which  strikes  occurred;  games  played 
during  these  years  may  not  be  representative  of  how  teams  would  perform  under  usual 
conditions.  The  NFL  is  comprised  of  28  teams,  and  during  the  regular  season  each  team 
plays  16  games,  resulting  in  a  total  of  224  games  played  per  season.  For  each  game, 
we  recorded  the  final  score  for  each  team,  and  the  site  of  the  game.  Use  of  covariate 
information  could  likely  improve  the  predictive  power  of  the  model  (game  statistics,  for 
example),  but  no  additional  information  was  recorded. 
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3.1  A  Model  for  NFL  Score  Differences 


The  probability  model  that  we  assume  for  differences  in  final  scores  during  season  t, 
t  =  1981, . . 1991,  is  given  by 


where,  at  time  t, 


6, 


>(0 


6, 


(hfa) 

'k,jk 

x(‘) 


=  the  score  difference  of  game  k 

=  rating  parameter  for  team  i 
—  home  held  advantage  parameter 
=  indices  of  teams  involved  in  the  fc-th  game 

|  1  if  team  it  plays  on  their  home  field 

~  |  — 1  if  team  jk  plays  on  their  home  field 

=  variance  of  conditional  on  the  mean 


(3.1) 


The  team  parameters,  i  =  1,...,28,  indicate  the  relative  scoring  abilities  of  the 
teams  daring  season  t.  The  difference  is  the  score  difference  expected  to  occur 

between  teams  t*  and  j*  daring  season  t,  not  accoonting  for  the  sr  :■  of  the  game.  The 
parameter  the  home-field  advantage  (HFA)  parameter,  specifies  the  amount  on 

average  the  home  team  will  outscore  the  visiting  team,  and  is  assumed  constant  for  all 
teams  and  seasons.  The  variance  of  the  score  difference  about  the  mean,  r2,  is  also 
assumed  independent  of  teams  involved  in  a  game  and  of  season. 


The  model  in  (3.1)  is  intended  only  as  an  approximation.  NFL  games  scores  can  take 
on  only  integer  values,  so  the  assumption  of  normality  conditional  on  the  parameter  values 
cannot  be  considered  exact.  Modeling  football  game  score  differences  as  discrete  outcomes 
may  be  more  realistic,  but  the  analysis  of  such  a  model  may  prove  to  be  intractable.  An 
example  of  such  an  analysis  appears  in  Rosner  (1976).  The  normal  approximation  allows 
for  the  use  of  the  methodology  developed  in  Chapter  2. 


For  the  NFL  data,  we  assume  a  model  fo:  the  evolution  of  the  parameters  given  by 

0(0  =  0(t-i)  +  „(0, 

where  8^  and  denote  the  vectors  of  28  team  parameters  for  seasons  t  and  t  —  1, 

respectively,  and  i*W  is  the  amount  of  change  incurred  in  the  parameter  values  between 
seasons  t  -  1  and  t.  We  assume  for  the  analysis  of  NFL  game  scores  that 

i/<‘>~N(0,<r2I). 

The  model  assumes  that  team  abilities  change  independently  from  season  to  season,  the 
expected  change  being  0,  and  the  variance  constant  for  all  teams.  The  HFA  parameter  is 
assumed  to  remain  constant  over  time. 
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The  model  we  choose  for  the  evolution  of  team  parameters  suggests  that  teams  with 
low  abilities  undergo  roughly  the  same  amount  of  change  in  ability  from  season  t  -  1  to 
t  as  teams  with  higher  abilities.  This  assumption  is  probably  not  quite  right  because 
poorly  performing  teams  at  the  end  of  season  t  —  1  have  greater  chance  for  improvement, 
for  example,  by  obtaining  better  players  in  the  player  selection  draft  before  the  following 
season.  Thus  the  mean  change  from  season  to  season  may  be  a  function  of  the  team 
parameter  values  or  records  of  the  previous  season.  Furthermore,  the  mean  change  from 
season  to  season  may  be  a  function  of  other  covariates  that  are  difficult  to  quantify,  such 
as  player  trades  and  coaching  staff  changes.  The  model  as  postulated  also  assumes  that 
teams  do  not  undergo  ability  changes  during  a  season,  but  instead  only  change  between 
seasons.  While  team  abilities  do  change  during  a  season,  the  magnitude  of  the  change  will 
likely  be  small  relative  to  the  innovations  occurring  between  seasons,  so  the  simplifying 
assumption  of  constant  ability  within  a  season  is  not  unreasonable. 


3.1.1  Prior  Parameters 

We  use  a  vague  prior  distribution  on  the  parameters  to  reflect  our  initial  uncertainty.  This 
choice  allows  the  likelihood  of  the  data  to  determine  the  distribution  of  the  parameters. 
The  normal-gamma  parameters  for  the  distribution  of  (^1981\®(hfa))  31-6 

M(1981)  =  (0, . .  .,0,3) 

£<1981)  _  10Q 

J*<1981>  = 
v<1981>  =  .5. 

The  prior  means  for  the  team  parameters  are  all  zero,  that  is,  no  team  is  assumed  a 
priori  better  than  another,  and  the  mean  of  the  HFA  parameter  equivalent  to  a  3  point 
advantage  for  the  home  team.  The  variance  of  the  prior  means  is  assumed  to  be  100,  which 
corresponds  to  a  large  amount  of  uncertainty.  The  initial  uncertainty  of  the  observation 
variance  estimate  of  100  is  reflected  by  only  .5  degrees  of  freedom.  The  harmonic  mean 
variance  of  a  score  difference  is  assumed  to  be  100. 

We  assumed  a  vague  prior  for  the  distribution  of  w  =  l/<72  with  improper  density 

'(">  -  h- 

which  corresponds  to  a  prior  distribution  of  l/<r  on  a.  This  choice  gives  greater  weight  a 
priori  to  smaller  values  of  <r2.  This  prior  distribution  on  u  is  sufficiently  vague  to  allow 
the  likelihood  to  strongly  dominate. 
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з. 1.2  Model  Implementation 

In  this  section,  we  describe  the  implementation  of  the  iterative  and  non-iterative  analyses. 
The  model  accounts  for  the  lack  of  1982  and  1987  NFL  data  by  assuming  the  parameter 
means  do  not  change  for  these  years,  while  the  team  parameter  variances  increase  by 
a2.  When  performing  the  recursive  iterations  of  updating  and  forecasting  as  described 
in  Section  2.4,  the  updating  computation  is  omitted  for  1982  and  1987.  Otherwise,  the 
computation  proceeds  as  usual. 

Gibbs  Sampler  Implementation 

We  performed  an  analysis  of  the  model  for  NFL  data  using  the  Gibbs  sampler  tech¬ 
niques  of  Section  2.5.1  with  the  extension  for  non-dynamic  covariate  information  described 
in  Section  2.6.  We  ran  three  parallel  Gibbs  samplers  with  overdispersed  starting  values 
of  u>:  1/1002,  1/52,  and  1/.22.  Each  Gibbs  sampler  proceeds  by  drawing  the  parameters 
0(hfa),0^1981\...,0^1991\r2,  given  the  initial  value  of  w,  then  drawing  u>  conditional  on 
these  parameter  values,  and  so  on.  This  was  continued  for  450  iterations  for  each  sampler. 
Convergence  of  the  algorithm  was  checked  by  examining  the  potential  scale  reduction  of 

и,  as  described  in  Gelrran  and  Rubin  (1992a).  The  potential  scale  reduction  for  u>  is 
an  estimate  of  the  factor  by  which  the  variance  of  the  current  distribution  of  u>  in  the 
Gibbs  sampler  will  decrease  with  continued  iterations.  Values  of  the  potential  scale  re¬ 
duction  near  1  are  indicative  of  convergence.  For  the  NFL  model,  it  suffices  to  examine 
the  potential  scale  reduction  for  the  parameter  u>,  because  once  the  Gibbs  sampler  has 
reached  the  stationary  distribution  for  w,  the  distribution  of  the  remaining  parameters  is 
completely  specified.  Transforming  u  by  taking  logarithms  to  make  the  distribution  less 
skewed,  the  potential  scale  reduction  for  u  is  computed  to  be  1.027.  This  value  indicates 
that  continued  iterative  simulation  would  not  lead  to  improved  inferences.  We  therefore 
conclude  450  iterations  is  sufficient  to  reach  the  target  distribution.  Each  of  the  samplers 
were  continued  50  more  iterations  to  produce  150  more  draws  of  u,  and  then  a  random 
subsample  of  100  of  these  were  chosen  to  be  the  final  sample  drawn  from  the  marginal 
posterior  distribution  of  w.  The  sample  values  are  then  reexpressed  as  values  from  the 
posterior  distribution  <r  by  setting  a  =  l/y/u. 

Non-iterative  Analysis 

We  also  analyzed  the  model  using  the  non-iterative  methods  of  Section  2.5.2,  allowing 
for  the  inclusion  of  a  non-dynamic  covariate  as  described  in  Section  2.6.  We  approximated 
the  prior  distribution  of  a  to  be 

f{<?)  «  p 

where  a  has  been  discretized  and  takes  only  the  20  equally  spaced  values  (2.0, 2.16, ... ,  5.0). 
This  set  of  values  appears  to  be  overdispersed  for  the  posterior  distribution  of  <r,  as  in¬ 
dicated  by  the  Gibbs  sampler  analysis.  In  the  absence  of  such  information,  a  wider 
range  of  values  would  be  required.  Thus  we  have  comparable  prior  distributions  for  the 
Gibbs  sampler  and  non-iterative  analyses.  We  obtain  the  approximate  posterior  distri- 
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butioo  of  a  using  the  techniques  of  Section  2.5.2  by  computing  the  reweighting  factors, 

/(y(1992)K,  D1991)  •  •  •/(y(1981)l‘*'„Dj98o)- 

3.2  Results 

We  discuss  in  this  section  the  results  of  the  iterative  simulation  analysis  and  the  non¬ 
iterative  analysis.  We  examine  the  marginal  posterior  distributions  of  a  and  of  0(199l) 
along  with  0(hfa)»  and  compute  forecast  distributions  of  1992  results.  We  also  perform 
model  diagnostics  to  assess  the  validity  of  the  model. 


3.2.1  Marginal  Posterior  Distribution  of  a 

Figure  3.1  displays  the  distribution  of  a,  the  system  standard  deviation,  for  both  the 
iterative  simulation  analysis  and  the  non-iterative  analysis.  The  top  histogram  shows  the 
posterior  distribution  based  on  the  sample  of  100  draws  from  the  Gibbs  sampler,  while 
the  bottom  histogram  shows  the  posterior  distribution  calculated  from  the  grid  of  20 
equally  spaced  values.  The  similarity  of  the  distributions  indicates  general  agreement  of 
the  methodologies  in  producing  comparable  results.  As  we  show  in  subsequent  analyses, 
the  slight  difference  in  the  approximated  distributions  for  a  does  not  appear  to  have  large 
effect  on  the  marginal  distribution  of  the  team  parameters  or  on  predictive  distributions. 

The  estimated  posterior  mean  and  standard  deviation  of  tr,  the  system  standard  devi¬ 
ation  of  the  model,  calculated  from  the  iterative  analysis  are  3.24  and  0.304,  respectively. 
From  the  non-iterative  analysis,  the  mean  is  estimated  to  be  3.16  and  the  standard  de¬ 
viation  is  0.307.  Thus  the  results  are  consistent.  An  interpretation  of  this  estimate  is 
that,  between  seasons,  changes  in  teams’  scoring  capabilities  relative  to  other  teams  have 
a  standard  deviation  of  about  3  points.  Teams  that  have  ratings  within  a  few  points  of 
one  another  at  the  end  of  one  season  are  not  unlikely  to  switch  positions  in  the  following 
season. 


3.2.2  Marginal  Posterior  Distribution  of  Ratings  and  HFA  parameters 

To  obtain  the  marginal  posterior  distribution  of  0*1991'  and  0(hfa),  we  use^  the  techniques 
discussed  in  Section  2.7.  Marginal  posterior  means  and  standard  errors  of  0*1991)  are  with 
TTj  =  1/100  for  all  i  —  1, . . .,  100  in  the  Gibbs  sampler  analysis,  and  with  x,  equal  to  the 
posterior  probabilities  given  in  the  bottom  histogram  in  Figure  3.1.  We  also  compute 
estimated  95%  credible  intervals  by  drawing  3000  samples  from  the  posterior  distribution 
of  0<1991)  and  reporting  the  approximate  2.5%  and  97.5%  percentiles  for  each  parameter. 
These  values  for  the  Gibbs  sampler  analysis  are  summarized  in  Table  3.1  and  for  the 
non-iterative  analysis  in  Table  3.2. 
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Parameter 

Mean 

Std  Dev 

95%  Credible  Interval 

Washington  Redskins 

11.45 

4.04 

(  3.81, 

19.54) 

San  Francisco  49ers 

9.47 

4.01 

(  1-01, 

17.47) 

Houston  Oilers 

6.15 

4.02 

(  -1-79, 

14.08) 

New  Orleans  Saints 

5.85 

4.02 

(  -2.04, 

13.70) 

Buffalo  Bills 

5.66 

4.02 

(  -2.34, 

13.59) 

Kansas  City  Chiefs 

5.29 

4.02 

(  -2.88, 

13.26) 

Philadelphia  Eagles 

■S3 

(  -3.88, 

12.05) 

New  York  Giants 

2.57 

4.02 

(  -5.55, 

10.45) 

Denver  Broncos 

2.47 

(  "5.47, 

10.15) 

Los  Angeles  Raiders 

2.24 

4.02 

(  -5-74, 

10.15) 

Chicago  Bears 

2.01 

4.02 

(  -6.16, 

10.17) 

Seattle  Seahawks 

1.07 

4.02 

(  -6.58, 

9.03) 

Atlanta  Falcons 

0.52 

4.02 

(  -7.35, 

8.61) 

Detroit  Lions 

Km 

(  -7-80, 

8.07) 

Dallas  Cowboys 

0.08 

4.03 

(  -7.83, 

8.18) 

Minnesota  Vikings 

0.02 

4.03 

(  -7-97, 

8.43) 

Miami  Dolphins 

-1.28 

4.02 

(  -9.43, 

6.61) 

San  Diego  Chargers 

-1.51 

4.02 

(  -9.34, 

6.69) 

Pittsburgh  Steelers 

-1.82 

4.01 

(  "9.77, 

6.46) 

Cleveland  Browns 

-2.84 

4.01 

(-10.63, 

5.30) 

New  York  Jets 

-2.87 

4.02 

(-10.85, 

5.42) 

Green  Bay  Packers 

-3.77 

4.02 

(-11.59, 

4.06) 

Los  Angeles  Rams 

-4.27 

4.02 

(-12.42, 

3.47) 

Cincinnati  Bengals 

-4.89 

4.03 

(-12.97, 

2.80) 

Phoenix  Cardinals 

-6.59 

4.02 

(-14.59, 

1.46) 

Tampa  Bay  Buccaneers 

-8.88 

4.02 

(-16.82, 

-0.89) 

New  England  Patriots 

-8.96 

(-16.86, 

-0.64) 

Indianapolis  Colts 

-11.29 

4.05 

(-19.13, 

-3.40) 

Home  Field  Advantage 

2.96 

0.29 

(  2.41, 

3.53) 

Table  3.1:  retribution  of  0(1991)  from  Gibbs  Sampler  Analysis 
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Parameter 

Mean 

Std  Dev 

95%  Credible  Interval 

Washington  Redskins 

11.33 

4.02 

(  3.41, 

19.38) 

San  Francisco  49ers 

9.45 

3.98 

(  1.61, 

17.19) 

Houston  Oilers 

6.07 

(  -1.89, 

13.87) 

New  Orleans  Saints 

5.79 

3.99 

(  -2.21, 

13.69) 

Buffalo  Bills 

5.66 

3.99 

(  -2.34, 

13.72) 

Kansas  City  Chiefs 

5.25 

4.00 

(  -2.51, 

13.31) 

Philadelphia  Eagles 

4.01 

3.99 

(  -3.64, 

12.11) 

New  York  Giants 

2.61 

(  -5.34, 

10.57) 

Denver  Broncos 

2.46 

3.99 

(  -5.53, 

10.29) 

Los  Angeles  Raiders 

2.24 

3.99 

(  -5.69, 

9.98) 

Chicago  Bears 

2.02 

3.99 

(  -5.99, 

9.65) 

Seattle  Seahawks 

1.06 

3.99 

(  -6.43, 

9.24) 

Atlanta  Falcons 

0.45 

3.99 

(  -7.39, 

8.41) 

Minnesota  Vikings 

0.08 

4.00 

(  -7-94, 

7.97) 

Detroit  Lions 

0.06 

3.99 

(  -7.99, 

7.56) 

Dallas  Cowboys 

IHESSl 

4.01 

(  -7-91, 

7.99) 

Miami  Dolphins 

-1.25 

3.99 

(  -9.18, 

6.54) 

San  Diego  Chargers 

IEJ&U 

3.99 

(  -9.20, 

6.78) 

Pittsburgh  Steelers 

-1.80 

3.98 

(  -9.95, 

6.29) 

Cleveland  Browns 

-2.84 

3.98 

(-11.01, 

4.87) 

New  York  Jets 

-2.89 

3.99 

(-10.67, 

4.85) 

Green  Bay  Packers 

-3.76 

3.99 

(-11.92, 

3.89) 

Los  Angeles  Rams 

-4.19 

(-11.95, 

3.80) 

Cincinnati  Bengals 

-4.78 

4.01 

(-12.74, 

3.57) 

Phoenix  Cardinals 

-6.58 

3.99 

(-14.37, 

1.35) 

Tampa  Bay  Buccaneers 

-8.85 

3.99 

(-16.73, 

-1.06) 

New  England  Patriots 

-8.93 

3.99 

(-17.18, 

-1.22) 

Indianapolis  Colts 

-11.17 

4.03 

(-19.68, 

-3.43) 

Home  Field  Advantage 

2.96 

0.29 

(  2.38, 

3.53) 

Table  3.2:  Distribution  of  from  Non-Iterative  Analysis 
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Distribution  of  a  from  Gibbs  Sampler 


25  3J)  3-5  4J>  45  5  JD 

Values  of  tr 


Figure  3.1:  Posterior  Distribution  of  a 

Tables  3.1  and  3.2  rank  the  teams  in  order  of  the  posterior  mean  of  0(1991K  Both 
analyses  produce  the  same  rankings,  with  the  exception  of  the  Vikings,  the  Lions  and  the 
Cowboys.  In  the  Gibbs  sampler  analysis,  the  Vikings  are  ranked  below  the  Lions  and 
the  Cowboys,  while  in  the  non-iterative  analysis  the  Vikings  are  ranked  higher.  In  both 
analyses,  these  three  teams  have  posterior  means  estimated  to  be  within  0.1  points  of 
each  other,  the  results  still  appear  consistent  in  spite  of  the  reordering  of  the  teams.  The 
means  of  the  team  parameters  for  each  analysis  are  within  0.2  points  of  each  other  and 
the  standard  deviations  match  even  more  closely.  The  standard  deviations  in  the  Gibbs 
sampler  analysis  are  consistently  larger  than  those  in  the  non-iterative  analysis,  but  size 
of  the  difference  suggests  that  the  cause  may  be  attributed  extreme  values  of  the  posterior 
distribution  not  being  represented  in  the  small  sample  obtained  in  the  Gibbs  sampler. 

The  means  can  be  best  interpreted  as  differences.  For  example,  from  Table  3.1,  the 
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value  of  the  means  suggest  that  we  would  expect  at  the  end  of  the  1991  season  the  Redskins 
to  beat  the  49ers  by  about  2  points,  excluding  the  effect  of  home  field.  A  more  extreme 
example  would  be  that  we  would  expect  the  Redskins  to  outscore  the  Colts  by  about  23 
points  on  average. 

The  standard  deviations  of  the  rating  parameters  are  close  to  4,  indicating  substantial 
variability  in  the  parameter  distributions.  Posterior  correlations  among  the  rating  param¬ 
eters  are  about  0.6,  so  that  the  standard  deviation  for  differences  in  rating  parameters  is 
close  to  3.7.  The  data  therefore  provide  more  evidence  about  differences  in  parameters 
than  the  individual  team  ratings. 

The  HFA  parameter  has  a  posterior  mean  of  about  2.96  with  a  small  standard  error. 
This  can  be  interpreted  as  giving  the  home  team  approximately  a  3-point  advantage.  The 
standard  error  for  the  HFA  parameter  is  small  because  it  16  assumed  to  remain  constant 
over  time,  and  all  of  the  data  is  used  to  estimate  the  posterior  distribution  of  0(hfa)’  so 
that  unlike  the  team  parameters  there  is  no  extra  uncertainty  incurred  over  time. 


3.2.3  Forecast  Distribution  of  Score  Differences 

To  illustrate  the  techniques  described  in  Section  2.7,  we  obtain  the  forecast  distribution 
of  selected  game  outcomes  for  the  1992  season.  For  the  first  20  games  of  the  season,  we 
computed  the  predicted  score  differences  and  standard  errors.  In  addition,  we  simulated 
3000  draws  from  the  approximate  predictive  distribution  of  y(1992)  to  obtain  50%  predic¬ 
tion  intervals  for  game  score  differences,  and  estimates  for  the  probability  of  the  first  team 
in  a  pair  winning.  Winning  probabilities  were  computed  by  counting  the  fraction  of  the 
3000  simulated  game  outcomes  that  produced  score  differences  greater  than  0. 

Tables  3.3  and  3.4  6how  the  resulting  analysis.  Again,  the  comparison  between  the 
two  analyses  lead  to  very  similar  results.  For  all  20  games,  the  predicted  score  results 
are  within  0.2  between  analyses.  The  50%  prediction  intervals,  which  are  produced  by 
simulation,  match  reasonably  closely  suggesting  that  the  different  distributions  of  a  do 
not  have  a  large  impact  on  the  resulting  predictions. 

It  is  worth  noting  that  the  50%  prediction  intervals  are  quite  wide.  We  also  computed 
95%  intervals  (not  shown)  which  spanned  about  eight  touchdowns.  Despite  the  widths,  the 
intervals  appear  to  be  honest,  as  8  observed  score  differences  fall  in  the  intervals  for  both 
the  Gibbs  sampler  analysis  and  for  the  non-iterative  analysis.  The  probabilities  computed 
from  simulations  of  the  predictive  distribution  do  not  seem  to  predict  substantially  better 
than  guessing  at  random  in  this  small  sample;  from  the  non-iterative  analysis,  the  log- 
likelihood  (base  10)  for  the  predictive  probabilities  is  —5.91,  whereas  the  corresponding 
log-likelihood  of  guessing  at  random  is  log^20)  =  -6.02.  A  more  thorough  analysis  of 
this  model  indicates  that  the  model  predictions  are  significantly  better  than  chance,  and 
better  than  competing  models  (Stern  1992a). 
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1992 

Games 

Predicted 
Score  Difference 

Actual 
Score  Difference 

50%  Prediction 
Interval 

Probability 

Win 

PHA 

vs. 

NO 

1.14 

2 

(  -7.88, 

11.65) 

0.55 

BUF 

vs. 

LAN 

12.89 

33 

(  3.59, 

23.14) 

0.82 

CHI 

vs. 

DET 

4.86 

3 

(  -4.24, 

14.14) 

0.64 

ATL 

vs. 

NYJ 

6.35 

3 

(  -2.54, 

16.14) 

0.69 

HOU 

vs. 

PIT 

10.93 

-5 

(  1-43, 

20.20) 

0.78 

MIN 

vs. 

GB 

0.84 

3 

(  -8.70, 

10.24) 

0.51 

CLE 

vs. 

IND 

5.49 

-11 

(  -3.86, 

14.93) 

0.65 

SF 

vs. 

NYG 

3.94 

17 

(  -5.50, 

14.16) 

0.61 

KC 

vs. 

SD 

3.84 

14 

(  -6.12, 

13.20) 

0.60 

SEA 

vs. 

CIN 

8.92 

.  _ -18 

(  -0.35, 

18.48) 

0.74 

TB 

vs. 

PHX 

0.66 

16 

(  -9-11, 

10.68) 

0.52 

DEN 

vs. 

LAA 

3.19 

4 

(  -6.38, 

12.41) 

0.60 

WAS 

vs. 

DAL 

8.40 

-13 

(  -1-24, 

18.26) 

0.72 

LAN 

vs. 

NE 

7.65 

14 

(  -2.54, 

17.15) 

0.69 

WAS 

vs. 

ATL 

13.89 

7 

(  4.46, 

23.19) 

0.84 

DAL 

vs. 

NYG 

-5.45 

6 

WBssm 

4.06) 

0.34 

NO 

vs. 

CHI 

6.80 

22 

(  -2.46, 

16.86) 

0.70 

LAA 

vs. 

CIN 

4.17 

-3 

(  -5.54, 

13.48) 

0.62 

DET 

vs. 

MIN 

3.04 

14 

(  -6.71, 

12.78) 

0.59 

KC 

vs. 

SEA 

7.18 

19 

(  -2.09, 

16.97) 

0.70 

Table  3.3:  Forecasts  of  1992  Games  from  Gibbs  Sampler  Analysis 
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1992 

Games 

Predicted 
Score  Difference 

Actual 
Score  Difference 

50%  Prediction 
Interval 

Probability 

Win 

PHA 

vs. 

NO 

1.19 

2 

(  -7.89, 

10.68) 

0.53 

BUF 

vs. 

LAN 

12.81 

33 

(  3.12, 

22.03) 

0.81 

CHI 

vs. 

DET 

4.92 

3 

(  -L98, 

14.78) 

0.64 

ATL 

vs. 

NYJ 

6.30 

3 

(  -3.16, 

15.95) 

0.68 

HOU 

vs. 

PIT 

10.84 

-5 

(  1-78, 

20.69) 

0.78 

MIN 

vs. 

GB 

0.88 

3 

(  -8.81, 

9.64) 

0.51 

CLE 

vs. 

IND 

5.36 

-11 

(  -3.94, 

15.14) 

0.65 

SF 

vs. 

NYG 

3.87 

17 

(  -5.82, 

13.37) 

0.60 

KC 

vs. 

SD 

3.79 

14 

(  -5.79, 

13.23) 

0.61 

SEA 

vs. 

CIN 

8.80 

-18 

(  -0.97, 

18.24) 

0.73 

TB 

vs. 

PHX 

0.70 

16 

(  -9.18, 

9.87) 

0.51 

DEN 

vs. 

LAA 

3.19 

4 

(  -6.20, 

12.84) 

0.59 

WAS 

vs. 

DAL 

8.37 

-13 

(  -1-84, 

17.87) 

0.72 

LAN 

vs. 

NE 

7.70 

14 

(  -1-81, 

17.60) 

0.70 

WAS 

vs. 

ATL 

13.84 

7 

(  3.69, 

23.45) 

0.82 

DAL 

vs. 

NYG 

-5.58 

6 

(-14.94, 

0.35 

NO 

vs. 

CHI 

6.72 

22 

(  -2.56, 

16.86) 

0.68 

LAA 

vs. 

CIN 

4.06 

-3 

(  -5.71, 

13.96) 

0.61 

DET 

vs. 

MIN 

2.95 

14 

(  -6.50, 

12.28) 

0.58 

KC 

vs. 

SEA 

7.15 

19 

(  -2.57, 

16.80) 

0.70 

Table  3.4:  Forecasts  of  1992  Games  from  Non-Iterative  Analysis 
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Values  of  r 


Figure  3.2:  Distribution  of  r  from  Non-Iterative  Analysis 
3.2.4  Distribution  of  the  Observation  Variance 

The  distribution  of  r,  the  observation  standard  deviation,  can  be  found  from  the  non¬ 
iterative  analysis  by  drawing  from  the  posterior  distribution  of  u;,  and  then  subsequently 
drawing  <f>  —  I/r2,  the  observation  precision,  from  the  gamma  distribution  conditional  on 
w.  We  simulated  10000  draws  from  the  posterior  distribution  of  r  and  show  the  resulting 
distribution  in  Figure  3.2.  The  distribution  of  r  appears  to  be  symmetric  with  mean  and 
standard  deviation  are  13.0  and  0.218  points,  respectively.  With  such  a  small  standard 
deviation,  the  parameter  r  is  well  estimated. 
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Figure  3.3:  Residuals  versus  Absolute  Predicted  Score  Differences 
3.2.5  Diagnostics 

We  examined  the  forecast  residuals  of  games  played  in  the  1992  season  using  the  results 
from  the  non-iterative  analysis.  Figure  3.3  shows  a  plot  of  residuals  against  absolute  values 
of  predicted  score  differences  for  the  first  181  games  of  the  1992  season  (those  available  at 
the  time  of  the  analysis).  The  ordering  of  teams  is  arbitrary,  so  the  sign  of  the  residuals 
are  not  meaningful.  The  plot  shows  random  scatter  with  no  clearly  emerging  pattern. 
The  plot  suggests  the  assumption  of  constant  variance  with  respect  to  predicted  values 
seems  appropriate.  A  plot  of  the  absolute  residuals  against  the  sum  of  team  scores  for 
the  181  games  is  shown  in  Figure  3.4.  We  might  expect  that  the  residuals  would  increase 
in  variability  with  larger  sums  of  scores  but  Figure  3.4  shows  no  substantial  increase  in 
variability  in  the  residuals. 
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Stun  of  Team  Scores 


Figure  3.4:  Forecast  Residuals  against  Sum  of  Scores 


We  checked  the  assumption  of  normality  by  examining  the  absolute  value  of  residuals 
from  the  1991  game  results.  A  half-normal  probability  plot  of  the  residuals  is  shown 
in  Figure  3.5.  From  this  plot,  the  residuals  appear  to  be  slightly  light-tailed,  although 
it  is  difficult  to  argue  that  the  distribution  differs  for  practical  purposes  from  a  normal 
distribution. 


3.3  The  Effect  of  Additional  Seasons 


We  compare  the  variability  of  the  parameters  when  different  amounts  of  data  are  incor¬ 
porated  into  the  analysis.  We  consider  three  analyses;  the  first  using  data  through  1984, 
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Positive  Quantiles  of  N(0,1) 


Figure  3.5:  Normal  Probability  Plot  of  1991  Residuals 

the  second  using  data  through  1988,  and  finally  the  complete  analysis  using  data  through 
1991.  Figure  3.6  shows  the  distribution  of  a,  the  innovation  standard  deviation,  computed 
from  the  non-iterative  analysis  in  each  case.  The  figure  shows  dearly  that  the  distribu¬ 
tion  of  a  becomes  less  variable  as  a  greater  amount  of  data  is  used  in  the  analysis.  The 
approximate  means  and  standard  deviations,  respectively,  for  each  of  the  distributions  are 
3.56  and  0.601  for  data  through  1984,  3.17  and  0.386  for  data  through  1988,  and  3.16  and 
0.307  for  data  through  1991. 

The  variability  of  predictions  using  the  three  different  distributions  of  a  are  compa¬ 
rable.  Table  3.5  shows  the  approximate  95%  prediction  intervals  for  hypothetical  games 
taking  place  at  the  beginning  of  the  1985,  1989  and  1992  seasons,  respectively,  using  the 
three  posterior  distributions  of  a  shown  in  Figure  3.6.  The  teams  are  those  involved  in 
the  first  20  games  of  the  1992  season.  We  computed  the  approximate  prediction  intervals 
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Widths  of  Approximate 
95%  Prediction  Intervals 


1992 

Games 

Data  through 
1984 

Data  through 
1988 

Data  through 
1991 

PHA 

vs. 

NO 

54.63 

57.11 

54.09 

BUF 

vs. 

LAN 

56.54 

57.36 

55.93 

CHI 

vs. 

DET 

53.52 

54.95 

56.62 

ATL 

vs. 

NYJ 

55.91 

55.26 

53.64 

HOU 

vs. 

PIT 

55.76 

55.40 

55.69 

MIN 

vs. 

GB 

55.40 

54.72 

55.10 

CLE 

vs. 

IND 

55.67 

56.54 

55.58 

SF 

vs. 

NYG 

55.27 

55.99 

53.82 

KC 

vs. 

SD 

55.40 

56.46 

55.12 

SEA 

vs. 

CIN 

55.43 

56.15 

55.38 

TB 

vs. 

PHX 

55.64 

55.90 

55.33 

DEN 

vs. 

LAA 

54.82 

56.62 

56.47 

WAS 

vs. 

DAL 

55.53 

56.39 

55.00 

LAN 

vs. 

NE 

53.52 

56.16 

57.19 

WAS 

vs. 

ATL 

55.95 

56.34 

57.12 

DAL 

vs. 

NYG 

54.66 

55.62 

54.58 

NO 

vs. 

CHI 

54.65 

56.69 

56.24 

LAA 

vs. 

CIN 

55.88 

55.52 

57.46 

DET 

vs. 

MIN 

55.17 

55.99 

57.09 

KC 

vs. 

SEA 

54.70 

57.62 

54.18 

Table  3.5:  95%  Prediction  Interval  Widths  for  Three  Sets  of  Data 
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Values  of  o  (data  through  1984) 
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Values  of  <r  (data  through  1991) 

Figure  3.6:  Distribution  of  <r  over  three  sets  of  data 

by  generating  3000  random  draws  of  the  predictive  distribution  for  score  differences  for 
each  analysis,  and  computing  the  difference  between  the  2.5%  and  97.5%  quantiles  for 
each  game.  The  table  shows  that  the  prediction  intervals  do  not  vary  much  by  year.  This 
suggests  that  greater  precision  in  a  does  not  appear  to  lead  to  noticeably  greater  precision 
of  the  forecasts. 
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3.4  Updating  the  Posterior  Distribution  for  1982  NFL  Game  Results 

3.4  Updating  the  Posterior  Distribution  for  1992  NFL  Game 
Results 


An  important  aspect  of  the  model  is  the  ability  to  update  the  parameter  distributions 
as  new  data  is  acquired.  We  computed  parameter  updates  incorporating  the  first  four 
weeks  of  NFL  game  scores  (56  observations)  into  the  analysis  using  the  non-iterative 
techniques  of  Section  2.5.2.  The  distribution  of  a  changed  very  little;  the  mean  and 
standard  deviation  of  a  after  incorporating  the  four  weeks  of  data  from  1992  were  3.19 
and  0.306,  respectively.  The  mean  changed  by  0.03  from  the  end  of  the  1991  season, 
and  the  change  in  standard  deviation  was  negligible.  We  also  computed  the  posterior 
mean  and  standard  deviation  for  the  team  parameters  and  for  the  home  field  advantage, 
shown  in  Table  3.6.  Even  early  in  the  1992  season,  some  teams  appear  to  have  changed 
appreciably  as  measured  by  the  mean  of  #(1992\  This  can  be  seen  by  comparing  to 
Table  3.2.  For  example,  the  Redskins  dropped  2.7  points  in  mean  point  scoring  ability 
whereas  Buffalo  increased  by  about  the  same  amount.  The  Dallas  Cowboys,  the  eventual 
league  champions,  have  improved  from  16th  to  9th  best  after  just  four  weeks.  The  standard 
deviations  are  larger  than  those  in  Table  3.2  because  under  the  assumptions  of  the  model 
there  is  greater  uncertainty  in  the  team  parameters  towards  the  beginning  of  the  season. 
This  is  also  reflected  in  the  reported  approximate  95%  credible  intervals,  which  are  wider 
when  the  1992  results  from  the  beginning  of  the  season  are  incorporated  into  the  analysis. 
The  posterior  mean  of  the  home  field  advantage  parameter  changes  from  2.96  to  2.94. 


3.5  Discussion 


The  analysis  of  the  NFL  data  using  the  methodology  of  Chapter  2  appears  to  be  a  rea¬ 
sonable  and  flexible  approach  to  forecasting  NFL  game  outcomes.  For  the  1991  season, 
the  team  with  the  highest  expected  ability  as  measured  by  the  posterior  means  was  the 
winner  of  the  superbowl.  Furthermore,  the  model  seems  to  provide  honest  posterior  pre¬ 
diction  intervals,  the  14  point  posterior  predictive  standard  deviation  matching  the  result 
obtained  by  Stern  (1991)  in  an  analysis  of  NFL  scores. 

An  issue  which  may  affect  the  validity  of  the  results  is  the  non-normality  of  the  data. 
Because  the  scoring  system  in  football  enables  certain  scores  to  be  more  likely  to  occur 
than  others,  treating  score  differences  as  continuous  may  be  too  approximate.  A  more 
accurate  approach  might  attempt  to  model  the  response  as  a  discrete  variable. 

The  normal  model  for  football  scores  may  also  be  inappropriate  for  describing  the 
behavior  of  large  score  differences.  Because  the  object  of  the  game  is  merely  to  outscore 
an  opponent,  and  not  maximize  the  amount  by  which  a  team  outscores  another,  there  is 
no  strong  incentive  to  “blow  out”  an  opponent. 


Other  work  on  modeling  professional  football  scores  have  incorporated  a  variety  of 
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Parameter 

Mean 

Std  Dev 

95%  Credible  Interval 

San  Francisco  49ers 

10.60 

4.63 

(  1-52, 

19.92) 

Washington  Redskins 

8.65 

4.75 

(  -0.37, 

17.87) 

Buffalo  Bills 

8.20 

4.56 

(  -0-77, 

17.26) 

Kansas  City  Chiefs 

6.83 

4.66 

(  -2.21, 

15.89) 

New  Orleans  Saints 

6.58 

4.65 

(  -2.28, 

15.94) 

Philadelphia  Eagles 

6.57 

4.73 

(  -2.52, 

16.16) 

Houston  Oilers 

5 .88 

4.64 

(  -3.16, 

15.37) 

Miami  Dolphins 

2.77 

4.64 

(  -6.44, 

11.75) 

Dallas  Cowboys 

2.23 

4.76 

(  -7.18, 

11.67) 

New  York  Giants 

2.16 

4.74 

(  -7.18, 

11.54) 

Minnesota  Vikings 

1.40 

4.57 

(  -7.32, 

10.38) 

Denver  Broncos 

0.88 

4.65 

(  -8.06, 

10.13) 

Detroit  Lions 

■LMI 

4.65 

(  -8-H, 

9.85) 

Chicago  Bears 

0.59 

4.57 

(  -8.31, 

9.49) 

Atlanta  Falcons 

0.29 

4.56 

(  -8.27, 

9.20) 

Pittsburgh  Steelers 

0.19 

4.63 

(  -9.08, 

9.43) 

Los  Angeles  Raiders 

-0.93 

4.65 

(-10.31, 

8.40) 

Seattle  Seahawks 

-2.31 

4.64 

(-11.28, 

7.08) 

Cleveland  Browns 

-3.58 

4.63 

(-12.44, 

5.63) 

New  York  Jets 

-4.29 

4.63 

(-13.24, 

4.95) 

San  Diego  Chargers 

-4.90 

4.65 

(-14.25, 

4.41) 

Cincinnati  Bengals 

-4.96 

4.64 

(-13.93, 

4.22) 

Green  Bay  Packers 

-5.00 

4.56 

(-14.01, 

3.88) 

Tampa  Bay  Buccaneers 

-5.24 

4.57 

(-13.98, 

3.67) 

Los  Angeles  Rams 

-5.25 

4.64 

(-13.94, 

3.81) 

Phoenix  Cardinals 

-7.86 

4.74 

(-17.06, 

1.20) 

Indianapolis  Colts 

-9.47 

4.65 

(-18.28, 

-0.45) 

New  England  Patriots 

-10.84 

4.73 

(-20.04, 

-1.48) 

Home  Field  Advantage 

2.94 

(  2.38, 

3.49) 

Table  3.6:  Distribution  of  0<1992)  after  4  weeks  of  1992  NFL  Game  Scores 
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approaches.  Harville  (1980)  constructed  linear  models  for  football  scores  where  the  pa¬ 
rameters  follow  an  autoregressive  process,  and  the  parameters  are  estimated  by  restricted 
maximum  likelihood  estimation.  Sail  as  and  Harville  (1988)  propose  a  Bayesian  dynamic 
linear  model  for  football  scores,  but  unlike  our  model,  they  estimate  the  system  variance 
parameter  a  priori.  Some  least  squares  approaches  to  predicting  football  outcomes  are 
developed  in  Stefani  (1977)  and  Stefani  (1980),  while  maximum  likelihood  methods  for 
rating  football  teams  are  proposed  in  Thompson  (1975).  Stern  (1992a)  compares  some  of 
these  different  methods  for  rating  NFL  football  teams  and  find  that  the  Bayesian  dynamic 
model  (with  <r2  fixed)  performs  better  than  single  season  or  non-dynamic  approaches. 


Chapter  4 


Paired  Comparison  Models  with 
Indicator  Outcomes 


In  Chapter  2,  we  considered  a  series  of  models  for  paired  comparison  experiments  where 
the  response  was  the  final  score  difference  of  a  competition.  This  chapter  treats  the  paired 
comparison  setting  in  which  the  only  available  information  on  the  result  of  a  comparison 
is  the  identity  of  the  winning  team  or  preferred  object.  The  response  variables  for  the 
models  considered  here  are  binary  indicator  variables :  ither  than  continuous  variables.  In 
the  following  sections,  we  examine  the  Bradley-Terry  paired  comparison  model  along  with 
a  Bayesian  extension,  and  demonstrate  the  construction  of  a  dynamic  paired  comparison 
model  that  allows  parameters  to  change  over  time.  The  methodology  in  this  chapter  is 
similar  to  that  of  Chapter  2.  We  show  how  to  analyze  the  data  under  the  model  using 
both  an  iterative  simulation  approach  and  a  non-iterative  method.  We  also  consider 
an  extension  of  the  model  to  accommodate  paired  comparison  experiments  in  which  the 
outcome  may  be  a  tie  or  an  indication  of  no  preference.  In  such  cases,  the  response 
variable  is  trinary  rather  than  binary. 


4.1  The  Bradley-Terry  Model 


The  paired  comparison  model  that  we  consider  as  the  basis  for  the  development  of  the 
methodology  in  this  chapter  is  the  Bradley-Terry  model(Bradley  and  Teny  1952).  Al¬ 
though  the  model  appeared  prior  to  their  1952  paper  (see,  for  example,  Zermelo  1929), 
the  model  is  connected  with  their  names  due  to  the  scope  of  their  work  on  the  subject. 

Suppose  p  teams  engage  in  competition.  Assume  that  each  team  *  has  an  associated 
parameter  jt,-.  This  parameter,  the  so-called  worth  parameter,  can  be  interpreted  as  the 
relative  ability  or  strength  of  the  competing  team.  In  conventional  paired  comparison 
settings,  n,  is  interpreted  as  the  “merit”  of  the  t-th  object  that  is  to  be  judged.  The 
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Bradley-Terry  model  represents  the  probability  one  team  defeats  another  as 

Pij  =  Pr(i  defeats  j)  =  (4.1) 

-f  » j 

Note  that  pij  +  Pji  =  1  so  that  the  model  as  stated  applies  only  to  paired  comparisons 
without  ties.  Also,  6ince  the  values  of  t,  are  unique  only  up  to  a  multiplicative  constant 
in  determining  the  win  probabilities,  a  constraint  is  usually  imposed  on  the  x.’s,  typically 
=  1.  The  simple  formulation  of  the  Bradley-Terry  model  has  made  it  one  of  the 
most  popular  paired  comparison  models. 

The  Bradley-Terry  model  can  be  derived  in  a  number  of  ways.  One  derivation,  useful 
for  showing  the  connection  to  the  models  in  Chapter  2,  assumes  that  when  team  t  com¬ 
petes,  it  produces  an  unobserved  score  Sj  independently  of  the  opposing  team  with  the 
cumulative  distribution  function 


Si  ~  Fi(s)  =  expC-e-*'-10*’*)). 


Thus  team  i  draws  a  random  score  from  an  extreme- value  distribution  (G  umbel  1961) 
with  location  parameter  log  x,.  It  directly  follows  that  the  distribution  of  the  difference 
Si  —  Sj  follows  a  logistic  distribution  with  mean  logx,  —  log*,-,  that  is, 

5,  —  Sj  ~  Ftj(a )  -  j  +  e-(.-(io**i-log*>)) » 

which  in  turn  implies 


Pr(S,  >  Sj)  =  Pr(S,  -  Sj  >  0)  =  1  - 


1 

1  +  glogWj-log^ 


*v 

+  Tj 


as  in  (4.1).  Alternative  motivations  for  the  Bradley-Terry  model  include  Yellott  (1977), 
who  derives  the  Bradley-Terry  model  from  Luce’s  Choice  Axiom  (Luce  1959),  and  Henery 
(1986)  and  Joe  (1988),  who  provide  a  maximum  entropy  derivation  of  the  Bradley-Terry 
model. 


A  comparison  can  be  made  here  between  the  above  derivation  of  the  Bradley-Terry 
model  and  the  normal  model  in  of  Chapter  2.  The  normal  model  postulates  that  score 
differences  follow  a  normal  distribution,  while  the  Bradley-Terry  model  postulates  that 
latent  score  differences  follow  a  standard  logistic  distribution.  One  feature  of  our  ver¬ 
sion  of  the  Bradley-Terry  model  is  the  lack  of  a  corresponding  parameter  to  r2  in  the 
normal  model.  This  is  not  a  critical  issue  because  the  Bradley-Terry  model  assumes 
that  scores  are  not  observed,  so  that  any  observation  variance  parameter  would  not  be 
identifiable  simultaneously  with  the  worth  parameters.  Many  paired  comparison  mod¬ 
els  allow  for  a  latent  observation  variance  parameter,  and  possibly  different  variances  for 
different  objects.  The  Bradley-Terry  model,  for  example,  can  be  extended  to  allow  for 
separate  variances  for  each  object.  Another  popular  model  is  the  Thurstone-Mosteller 
model  (Mosteller  1951;  Thurstone  1927)  which  assumes  that  preference  probabilities  can 
be  specified  as  pt;  =  $(—(7,-  -  lj)l +  erj),  where  #(•)  is  the  cumulative  distribution 
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function  of  a  standard  normal  random  variable.  Stern  (1992b)  has  shown  that  tbe  choice 
among  popular  paired  comparison  models,  including  the  Bradley- Terry  model  and  the 
Thurstone-Mosteller  model,  is  somewhat  arbitrary  in  the  sense  that  the  models  tend  to 
fit  paired  comparison  data  equally  well  or  poorly. 


In  analyzing  paired  comparison  data  under  the  model,  suppose  that  teams  t  and 
j  compete  ntJ  times,  with  team  i  winning  y,3  times  and  losing  nt]  —  ytJ  =  y]X  times. 
Then,  letting  it  -  (xj,  ...,  xp)  be  the  Bradley-Terry  worth  parameters,  the  distribution  of 
V  =  ( Vij .  i,j=  l,2,...,p)is 


f(y\ ») 


li  )v» 


The  Likelihood  for  it  can  be  expressed  as 


Uk<’,w  “ 


where  y,  =  Vij  >  the  total  number  of  wins  for  team  ». 


Maximum  likelihood  estimates  of  x,  can  be  obtained  by  the  Newton-Raphson  algo¬ 
rithm  in  the  usual  way,  incorporating  the  restriction  on  the  sum  of  the  x,’s.  Ford  (1957) 
shows  that  the  maximum  likelihood  estimate  is  unique  as  long  as  in  every  partition  of 
teams  into  two  non-empty  subsets,  some  team  in  the  second  has  beaten  some  team  in 
the  first.  This  condition  implies  that  maximum  likelihood  estimates  cannot  be  found  if 
any  team  went  undefeated,  or  if  any  team  lost  every  game.  It  follows  from  the  likeli¬ 
hood  that,  given  nt],  the  y,  are  sufficient  statistics  for  the  x,,  so  that  only  teams’  total 
scores  are  needed  to  perform  the  estimation.  Buhlmann  and  Huber  (1963)  show  that  the 
Bradley-Terry  model  is  the  only  linear  paired  comparison  model  for  which  this  is  true. 


4.2  The  Bayesian  Formulation 


The  Bayesian  model  we  adopt  here  is  due  to  Leonard  (1977).  For  a  p-team  population, 
we  assume  that  the  probability  team  i  defeats  team  j  is  given  by  (4.1).  We  impose  a 
multivariate  normal  prior  distribution  on  the  log  x;.  Notice  that  it  is  not  appropriate  to 
consider  a  multivariate  normal  distribution  on  the  it,  because  x,-  ranges  only  over  values 
greater  than  zero,  Reparametrizing  the  model  in  terms  of  7,-  =  logx,,  we  have 


Pi j  = 


+ 


e1' 

e'r-  +  e1*  ’ 


where  7,  is  the  i-th  team’s  “rating,”  we  can  assume  a  multivariate  normal  prior  on  7  = 
(71, ...,7m)  with  mean  vector  (i  =  (pj,...,pm)  and  non-singular  covariance  matrix  C. 
Naturally,  under  these  assumptions,  we  release  the  constraint  that  it,  =  1. 
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If  we  view  7,  as  representing  the  relative  playing  strength  of  team  » ,  imposing  a  mul¬ 
tivariate  normal  distribution  on  7  has  a  nice  interpretation.  Recalling  that  the  Bradley- 
Terry  model  can  be  derived  from  the  assumption  that  team  scores  are  drawn  from  extreme- 
value  distributions  with  location  parameters  log  the  multivariate  normal  prior  distri¬ 
bution  on  7  gives  information  on  the  location  of  each  individual  team’s  performance 
distribution.  The  mean  component,  /i,-,  of  the  prior  distribution  represents  the  location  of 
the  distribution  describing,  on  average,  team  i’s  performances.  The  larger  the  value  of 
the  greater  ability  of  the  team.  The  diagonal  elements,  c,i,  of  the  covariance  matrix,  C, 
indicate  the  variability  about  the  means  m  of  teams’  rating  distributions.  The  greater  the 
variance  of  7 i,  the  less  certainty  exists  about  a  team’s  ability.  The  off-diagonal  elements, 
ct] ,  are  the  covariances  between  7,  and  7 j.  When  c,;  approaches  zero,  the  location  pa¬ 
rameters  7,  and  7 j  vary  independently  of  one  another,  so  that  locations  of  ratings  of  two 
competing  teams  are  independent.  When  the  correlation  between  7,  and  y3  approaches  1, 
then  (7i,7j)  has  a  degenerate  bivariate  normal  distribution.  This  suggests  that  once  we 
know  the  location  of  team  i’s  performance  distribution,  we  then  know  exactly  the  location 
of  team  j's  performance  distribution.  A  perfect  correlation  and  equal  variances  implies 
that  the  difference  7,  —  7 j  is  constant. 


Given  a  multivariate  normal  prior  distribution  on  the  parameters  7,  and  suppressing 
the  conditioning  on  the  prior  parameters,  we  have 

/(7)  oc  exp(-|(7  -  m)TC~1(t  -  m))- 


The  Bayesian  analysis  proceeds  by  computing  a  posterior  distribution  for  7  given  the 
tournament  results,  y.  The  likelihood  after  reparametrizing  is  given  by 


Lik(7|y)  a 


y-r  (e'>'  )V'J  (e'Y*  )**• 
"  (e'*r*  +  e'b  )n*j 


(4.2) 


The  posterior  distribution  for  7  is  proportional  to  the  product  of  the  prior  and  the  likeli¬ 
hood,  which  can  be  written  as 


f(y\y)  « 

oc 


/(7)Lik(7|p) 


exp(-i(7~M)TC-1(7-M)) 


(tj  (e'r‘  )v'3  (e'yj  )V}'  \ 
\<<j  (e'n  +  C',J  J 


Leonard  (1977)  suggests  approximating  the  posterior  distribution  by  a  multivariate  normal 
distribution,  thereby  making  the  parameter  updating  nearly  conjugate.  To  do  this,  the 
posterior  mode  is  obtained  by  performing  the  Newton-Raphson  algorithm  on  the  log- 
posterior  distribution  for  7.  The  posterior  covariance  matrix  is  estimated  by  the  inverse 
of  the  non-singular  Hessian  matrix  used  in  the  last  iteration. 

Alternatively,  if  the  likelihood  is  approximately  normal,  the  posterior  distribution  can 
be  approximated  in  the  following  manner.  The  maximum  likelihood  estimate,  7,  and 
estimated  singular  information  matri;: .  R,  of  7  from  the  likelihood  in  (4.2)  can  be  found 
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using  the  Newton-Raphson  algorithm.  We  can  then  approximate  the  distribution  of  7  as 
a  multivariate  normal  distribution  with  mean  7  and  singular  precision  matrix  R.  Then 
the  posterior  density  for  7  is  the  product  of  a  normal  prior  density  and  an  approximately 
normal  likelihood,  and  therefore  results  in  an  approximately  normal  posterior  distribution. 
This  method  for  approximating  the  posterior  by  a  multivariate  normal  is  equivalent  to 
first  expanding  the  log-likelihood  in  a  Taylor  Series  around  7,  dropping  the  cubic  and 
higher  order  terms,  and  combining  the  remaining  expression  with  the  prior  distribution 
to  form  a  multivariate  normal  density.  The  approximate  posterior  distribution  for  7  is 
attained  in  the  usual  manner, 


/( 7l»)  «  exp(-j(7  -  rfC  -*(7  -  #»')) 

where 

*i'  =  (c^+tfrHcrV  +  tfT) 
c'  =  (cr1  +  R)-1 

are  the  posterior  parameters.  As  can  be  seen  from  the  above  equation,  the  posterior  mean 
can  be  interpreted  as  a  weighted  average  of  the  prior  mean  and  the  estimated  mean  from 
the  data.  This  posterior  distribution  can  then  be  used  as  the  prior  distribution  when 
additional  data  becomes  available. 

The  implementation  of  this  model  is  straightforward.  Before  a  tournament  or  season  of 
competitions  begins,  a  normally  distributed  prior  distribution  is  assumed  on  the  ratings. 
The  tournament  is  played,  and  then  using  one  of  the  procedures  described,  we  compute 
posterior  estimates  of  the  mean  and  covariance  matrix  for  ratings  and  use  these  updated 
parameters  as  the  prior  parameters  for  the  next  tournament.  If  a  team  has  never  competed 
in  tournaments,  the  model  allows  for  the  assignment  of  a  large  prior  variance  to  describe 
the  uncertainty  in  the  location  of  the  team’s  performance  distribution. 

Other  Bayesian  formulations  of  the  Bradley- Terry  model  have  been  suggested.  The 
first  Bayesian  approach  was  developed  by  Davidson  and  Solomon  (1973).  They  introduce 
a  conjugate  prior  for  the  Bradley- Terry  likelihood,  and  the  worth  parameters  can  then 
be  estimated  by  finding  the  mode  of  the  posterior  distribution.  Chen  and  Smith  (1984) 
consider  a  Dirichlet  prior  on  the  worth  parameters.  For  their  model,  estimates  are  found 
by  taking  weighted  averages  of  posterior  means. 


4.3  A  Dynamic  Bradley- Terry  Model 


We  consider  in  this  section  a  dynamic  extension  to  the  Bradley-Terry  model  analogous  to 
the  dynamic  extension  of  the  normal  model  in  Section  2.3.  The  model  in  Section  4.2  is 
not  appropriate  for  the  situation  in  which  teams’  abilities  may  change  over  time  because 
it  treats  the  data  as  time  exchangeable. 
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Let  yM  =  i,j  =  1,2,  ...,p,»  /  j)  be  the  matrix  of  preference  outcomes  where 

yjj)  is  the  number  of  times  t  defeats  j  in  tournament  t,  with  t  —  1, . . .  ,T.  We  assume  the 
Bradley-Terry  model  as  before,  with 

—  Pr(i  defeats  j  in  tournament  t)  = - — - j-r-.  (4.3) 

exp(7-  ')  +  exp(7]  ') 

We  now  assume  a  probability  model  for  team  ratings  over  time  with 

y(t)  _  .j.  V(t)  (4.4) 

where  i/M  is  the  amount  team  abilities  change  from  tournament  t  -  1  to  t.  As  in  the 
normal  model,  we  assume  that  vW  is  stochastically  independent  of  though  more 

general  formulations  are  possible.  We  assume  that,  in  general, 

(i/‘>|<r2)  ~  N(at,craI,,),  (4.5) 

where  at  is  the  mean  amount  by  which  teams  change  between  tournaments  t  —  1  and  t, 
and  a2  is  the  variance  of  the  innovation  that  occurs  between  tournaments.  We  also  specify 
an  initial  prior  distribution  on  V1), 

7<1)~N(/r(1),C<1>) 

and  a  prior  distribution  obu=  1/c2, 

/(w|ao,  6o»  Do)  ot  w00_1e~6°to'  (4.6) 

where  the  hyperparameters  ^l\C^,ao  and  bo  axe  specified  in  advance,  and  can  be  chosen 
to  represent  prior  beliefs. 

The  model  in  (4.3)-(4.6)  show  unmistakable  similarity  to  the  dynamic  normal  model 
of  (2.8)-(2.12).  The  Bradley-Terry  model  in  (4.3)  models  the  outcomes  of  comparisons  at 
a  specific  point  in  time.  The  system  equation  of  the  dynamic  model  in  (4.4),  shows  how 
the  Bradley-Terry  ratings  change  over  time.  The  parameter  at  *n  (4.5),  which  may  be  a 
known  function  of  time  or  an  estimable  parameter,  indicates  how  teams’  abilities  evolve 
over  time.  As  in  the  normal  dynamic  model  of  Chapter  2,  a(  may  incorporate  information 
about  factors  that  might  influence  the  outcome  of  comparisons,  like  age,  preparation,  and 
so  on.  However,  as  in  Chapter  2,  we  set  a*  =  0  for  all  t  in  developing  the  dynamic  model. 
This  defines  a  random  walk  model  for  the  parameter  7W. 


4.4  Non-iterative  Analysis  of  the  Dynamic  Bradley-Terry 
Model 


We  consider  in  this  and  the  next  section  two  possible  methods  for  analyzing  data  under 
the  dynamic  Bradley-Terry  model.  These  two  sections  parallel  Sections  2.5.1  and  2.5.2 
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in  describing  a  non-iterative  approach  and  a  Gibbs  sampler  approach.  The  analyses  are 
made  more  complex  because  the  likelihood  for  each  paired  comparison  experiment  is  no 
longer  a  normal  density. 

This  section  demonstrates  the  methods  used  to  analyze  data  under  the  dynamic 
Bradley-Terry  model  using  a  non-iterative  approach.  As  with  the  normal  dynamic  model, 
the  analysis  of  the  joint  posterior  distribution  of  parameters  is  intractable  analytically, 
so  we  resort  to  similar  techniques  employed  in  Section  2.5.2.  The  non-iterative  approach 
involves  approximating  the  distribution  of  u;  by  a  discrete  distribution.  Furthermore, 
we  make  use  of  the  normal  approximation  to  the  Bradley-Terry  likelihoods  at  time  t,  as 
described  in  Section  4.2,  in  order  to  facilitate  the  analysis. 

To  motivate  the  non-iterative  approach,  we  assume  the  full  dynamic  Bradley-Terry 
model  described  in  Section  4.3.  We  are  primarily  interested  in  the  marginal  posterior 
distribution  of  Dr).  The  distribution  can  be  expressed  as 

/(7(T)|f?T)  =  J  f('y{T)\u,DT)  f{u\DT)du-  (4.7) 

The  first  density  in  the  integrand  can  be  approximated  analytically  using  the  methodology 
described  in  Section  4.4.1.  The  second  density,  however,  has  no  convenient  closed-form 
representation.  We  consider  an  approach  in  Section  4.4.2  that  provides  a  discrete  ap¬ 
proximation  to  f{u\DT).  In  Section  4.4.3  we  consider  approximations  to  the  integral  in 
(4.7)  in  order  to  examine  the  posterior  distribution  of  the  7^.  Techniques  for  updating 
the  distributions  of  the  parameters  to  include  new  data  at  time  T  +  1  are  described  in 
Section  4.4.4. 


4.4.1  Analysis  conditional  on  v 

We  develop  in  this  section  a  methodology  that  allows  us  to  approximate  the  posterior 
distribution  of  7^  conditional  on  u  by  a  multivariate  normal  distribution.  Once  we  find 
the  distribution  of  w,  perhaps  using  the  techniques  of  Section  4.4.2,  we  can  then  obtain 
the  posterior  distribution  of  7^  marginal  over  u>.  The  analysis  here  is  similar  to  the 
analysis  of  Section  2.4  for  normal  scores. 

The  full  posterior  posterior  distribution  of  all  parameters,  given  u>,  can  be  written  as 

/(7(1),-..,7(r)k,Z?j') 

a  {/(7(1)k,  Do)}  X  {/(di|7(1\u;,I>o)} 

x{/(7(2)|7(1),w,Z?x)}  X  {/(d2|7(1),7(2),w,0i)} 


X{/(7(T)|7(1)>---,7(T  X  {/(dr|7<1),...,7(r),w,I?r_i)}, 
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or  in  more  detail  as 

oc  exp(-^(7(1)  ~  M<1,)TC<1>  -1(7(1)  - 

3.  yj  exp(%'1)y«j))exp(Tj1)vjl)) 
X*1i(exp(7t0))  +  exp(7j1)))n-) 

T  f 

X  JI  <  exp(--(7(<)  -  7<‘_1))t(V‘)  -  7(<_1))) 
2 

x  TT  exP(^l<)y.'i))exP(T,j<)»il))  1 
i<j  (exp(7f })  +  exp(7j‘)))n^  J 


Normal  Approximation 

We  now  approximate  each  of  the  Bradley-Terry  likelihoods  by  a  multivariate  normal 
distribution  using  the  technique  of  Section  4.2.  For  tournament  t,  we  obtain  the  maximum 
likelihood  estimate,  7W,  and  the  estimated  information  matrix  of  the  parameter  7W, 

and  approximate  the  t-th  likelihood  by 

Lik(7(<)|dt)  oc  exp(-i(7<f>  -  7<‘>)TR(‘>(7(t)  _  ^(0)). 

We  therefore  approximate  the  full  posterior  distribution  of  all  parameters  by 
/(7(1),..-,7(T)|w,£t) 

oc  exp(-i(*Y^  —  m(1>)TCW  —  n^)) 

X  exp(-^(7^^  -  -yl1))7 -  7M)) 

a t 

T 

X  JI  {exp(-^(7(t)  -  7(,_1))T(t(‘)  -  7<,_1>)) 

t=2  *“  £ 

X  exp(-i(7{<)  -  7<t))Til(t)(-j,(‘)  _  7(*1))| . 

Thus  the  posterior  density  is  approximated  by  the  product  of  normal  densities,  and  is 
itself,  therefore,  an  approximately  normal  density.  Note  that  the  information  matrices, 
JRW  ,  are  singular  due  to  the  constraint  used  in  defining  the  maximum  likelihood  estimates. 
This  does  not  present  any  problems  here. 

Updating  and  Forecasting  Recursions 

To  obtain  the  marginal  distribution  of  (7^1|u;,  Dj)  conditional  on  u,  we  demonstrate  a 
recursion  analogous  to  that  of  Section  2.4  for  the  normal  model.  This  is  the  Bayesian  cal- 
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cula.tion  that  updates  the  distribution  of  (7^|w,  A-i)  10  the  distribution  of  (7(,*|w,  A)> 
and  then,  via  a  forecast  step,  obtains  the  distribution  of  A)- 

Suppose  before  observing  tournament  t  the  prior  distribution  on  7(<)  is 

(7<‘>| w,A-i)~N(/*<*>,C<‘>). 

We  now  observe  tournament  outcomes,  dt,  and  update  the  distribution  of  as  in  Sec¬ 
tion  4.2  to  obtain 

(7(‘>|u;,A)~ 

where 

M'<‘)  =  (C(t)  _1  +  J2(t))_1(C(t)  -1/i(f)  +  /Z(,)7(<)) 

C'f*)  = 

These  are  the  updating  equations  to  modify  the  distribution  of  parameters  with  observed 
data.  Now  time  passes  between  tournaments  t  and  t  +  1,  and  to  incorporate  the  extra 
uncertainty  due  to  the  passage  of  time  we  have 

(7(‘+%(‘U,A)~N(7(t),-Ip). 

1 0 

The  joint  posterior  distribution  of  7^+1*  and  7W  is  therefore 

/(7(*+1),7(t)K  A) 

=  /( 7(,)k,  A)  X  /(7(‘+1)l7(‘U,  Dt) 

<x  exp(-i(7(l)  -  fx',,l)TC'«>  -  ‘(-,<‘1  -  p*1)) 

*  «P<-^(V'+,>  -  -T'*»)T<-r('+*>  -  -r1")). 

Marginalizing  over  7W  yields 

/(7(t+1)k,A) 

=  //(7(t+1),7(t)iw,A)d7(t) 

«  exp(-i(7(‘+1)  -  ,i<‘+1>)TC<‘+1>  -»(7,t+1)  -  M(t+1))), 

where 


C«+D  _  c'(t)  +  ^Ip.  (4.8) 


These  are  the  forecasting  equations  to  account  for  the  extra  uncertainty  in  the  parameters 
due  to  the  passage  of  time.  The  interpretation  of  (4.8)  is  that  the  mean  rating  of  players 
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remains  constant  over  time  under  the  random  walk  assumption  for  the  innovations,  bet 
the  variances  increase  by  a 2  =  1/w.  We  now  have  the  approximate  desired  distribution. 

(7(t+1)|w,£>t)~  N(m(‘+1>,C(‘+1>). 

This  defines  one  complete  iteration  of  the  updating  and  forecasting  recursion  for  the 
dynamic  Bradley-Terry  model  conditional  on  u>. 

Predictive  Distribution  of  dt+\  given  u> 

Predictive  distributions  for  the  preference  probability  p{J+1*  at  time  T  + 1  given  u>  can 
be  obtained  by  marginalizing  over  7*7+1X  From  the  Bayesian  updating  and  forecasting 
recursions,  we  have  the  distribution 

(7^t+1)|w,  ~  N(/i^r+1\ C^T+1^). 

We  also  have  the  probability  t  is  preferred  to  j  at  time  T  +  1  conditional  on  7,-T+1*  and 
7 jT+1)  given  the  available  data, 


J>!j+1)(7(T+1)^,JDT)  = 


exp(7,(T+1)) 


exp(7r+,))  +  «p(7j"+10 

Then  the  expected  probability  marginalized  over  7<T+l)  is  given  by 

p<?+1)(u,DT) 

=  J P.-J+1)(7(T+1).w,I>r)  /(7(T+1>|w,  Dt)  d7(T+l) 
exP(7,-T'l~1)) 


IT+IU 


-  s 


exp(7,-T+1))  +  exp(7jT+1)) 


^(^T+Di^T+iJ  ^T+x))  d7(T+ 1)7 


where  ^(7(T+1)j/i(T+1),C^T+1^)  is  the  multivariate  normal  density  function  with  mean 
/i(T+1)  and  variance  C*T+1X  The  first  factor  of  the  integrand  only  depends  on  7,-T+1'  and 
jjT+1\  so  that  the  integral  reduces  to 


(T+l) 


•J 


(w,  Dt)  =  J 


exp(7,-r'l'1)) 


exp(7  ,-r+1))  +  exp(7^  **') 


(T+ih 


Wfa)  |M(0)  ’C/(0) 


<T+i>)d7r+i)d7f+i), 

(4.9) 


where  7^*^  =  (7,-T+1^7 jT+1^)T,  that  is,  the  subset  vector  of  7<T+1)  that  contains  com¬ 
ponent!  i  and  j.  The  parameters  and  are  the  corresponding  mean  vector 

and  variance  matrix  of  7^*1\  The  first  factor  of  the  integrand  can  be  expressed  as 

1/(1  +  exp(-(7,-r+1^  —  1jT+1^))),  and  then  a  change  of  variables  allows  us  to  express  the 
integral  as 

Dt)  =  XI  l  (n-«X(-))  V(“^’ C“>  *'• 
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where 


«  =  7|<r+1>  -  7<T+1) 

=  E(7r+i)>iiT+i)) 

Cu  =  Var(7<T+1)  -  7f+1)). 

While  no  closed-form  solution  exists,  numerical  methods  can  be  employed  to  calculate 
this  integral.  (Telman  and  King  (1990)  obtain  the  same  integral  in  a  different  context, 
and  they  approximate  — by  a  third  degree  polynomial  to  obtain  an  approximate 
value  of  the  integral.  An  alternative  approach  is  to  perform  Monte  Carlo  integration  by 
drawing  a  large  random  sample  of  values  tt*  from  the  normal  distribution  with  mean 
and  variance  Ctt,  and  using  the  sample  mean  of  1/2(1  4-  exp(— u*))  as  an  estimate  of 


4.4.2  Analysis  to  obtain  the  distribution  of  (u\Dt) 


In  Section  2.5.2,  we  introduced  the  method  of  approximating  the  prior  distribution  on  u>  by 
a  discrete  distribution,  and  showed  how  to  obtain  an  approximate  posterior  distribution 
on  (w|Z?7-).  Here  we  apply  this  method  to  the  analysis  of  the  dynamic  Bradley- Terry 
model.  As  in  the  normal  model,  we  assume  here  that  we  have  obtained  a  sample  from  the 
prior  distribution  on  u,  or  that  we  have  selected  overdispert  1  values  for  u  and  chosen 
probabilities  over  these  values  to  reflect  our  prior  beliefs. 


Suppose,  in  particular,  we  have  selected  a  set  of  m  values  of  u,  M, . .  .,wm},  and 
assume  we  have  approximated  the  prior  distribution  by  the  probability  function 


/(« I  A)  = 


if  U)  =  W; 

if  <*>  i 


so  that  /(w|  AO  has  mass  only  on  the  set  {u^, . .  .,wm}.  For  any  season  t,  we  have 
Bayes  theorem 


/MA) 


f(dt\Dt-i,u) 

/(4IA-0 


/MA-i). 


by 


Thus  the  marginal  distribution  of  u  given  data  through  season  t  has  a  simple  relationship 
to  the  distribution  of  u>  given  data  through  season  t  —  1.  Continuing  the  recursion,  we 
have  for  all  t 


/M  A)  = 


< x 


nu  MI^-1^) 

n>=i/Ki^-i) 


/MAO 


/MAofl/KPi-i,")- 
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Thus  the  marginal  density  of  (wl-Dr)  is  proportional  to  the  product  of  the  prior  density 
of  (u>|23o)  with  a  reweighting  factor,  which  is  the  product  of  conditional  probabilities  of 
the  form  A  single  term  in  this  product  can  be  written  as 

/(d,|A-t,u>)  =  j  (4.10) 

where  f{dt\y^\u)  is  the  product  of  Bradley-Terry  probabilities  at  tournament  t  given  the 
parameters  -yM,  and  is  the  multivariate  normal  density  that  is  obtained 

from  the  updating  and  forecasting  recursions  of  Section  4.4.1.  The  integral  in  (4.10)  cannot 
be  evaluated  in  closed  form,  so  we  use  Laplace’s  method  (see,  for  example,  Thisted  1988) 
to  obtain  an  approximate  evaluation  of  the  integral.  To  apply  Laplace’s  method,  we  find 
the  mode,  7*  =  7W,  of  f(dtly^,u>)  and  the  Hessian  matrix,  R^,  evaluated  at  the  mode. 
The  approximate  mode  of  the  entire  integrand  is  therefore  (R^  4-  -1  )~l(JR(t)-y(‘)  + 

C(t)  -1 /tW),  and  the  Hessian  matrix  of  the  integrand  is  R*  =  R<*>  +  C**)-1.  Laplace’s 
method  then  obtains 


f(dt\DT-i,u)  *  f(dt\y{t)  =  7*)/(tV°.  C^\u>Ury>'i\RTl/2-  (4.11) 

This  approximation  is  justified  by  performing  a  Taylor  series  expansion  of  the  logarithm 
of  the  integrand  around  its  mode,  and  dropping  the  cubic  and  higher  order  terms.  The 
method  works  best  when  the  log  of  the  integrand  is  well  approximated  by  a  quadratic 
form. 

The  approximate  posterior  distribution  of  u>  can  be  obtained  by  reweighting  the  prior 
distribution  by  a  product  of  terms  of  the  form  given  in  (4.11), 

(x)  „  ».""nT..Mk,-P,-.)  ,, ... 

'  ee.i40) 

In  summary,  to  calculate  the  approximate  posterior  distribution  of  w,  we  begin  by  choosing 
m  overdispersed  values  , . . . ,  um  of  u  and  set  corresponding  prior  probabilities  to  be 
7r{0*, . . . ,  Xm*.  Then  for  each  value  in  the  set  {u>i, . . .  ,u>m},  compute  the  approximate 
normal  parameters  associated  with  /(7^|u>i,  1),  for  all  t,  and  calculate  via  Laplace’s 

method  /(dt|wi, Dt-\)  as  given  by  (4.11).  The  posterior  updates,  trjT\  for  jrj°\  are  the 
reweightings  given  by  (4.12). 


4.4.3  Marginal  and  Forecasting  Distributions 

Inferences  on  7W,  or  predictions  of  future  outcomes,  can  be  obtained  by  integrating 
out  the  nuisance  parameter  u>.  We  discuss  in  this  section  methods  of  performing  such 
inferences  or  predictions. 

Marginal  Distribution  of  {y^\Dr) 
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From  the  analysis  of  the  dynamic  Bradley- Terry  model,  we  obtain  a  discretized  poste¬ 
rior  distribution  of  u.  Suppose  the  posterior  distribution  of  u>  has  positive  mass  only  on 
the  set  {u>i um }  and  that 


f(u\DT ) 


_  /  i 
l  0  i 


if  u>  =  u, 
ifw  i 


Moments  of  uj  or  a2  =  l/u>  can  be  calculated  relative  to  this  distribution. 

Making  inferences  on  the  distribution  of  ('y^l-Dr)  requires  marginalizing  over  u>.  We 
can  obtain  the  marginalized  posterior  density  of  y(^) 

H-rm\DT)  =  Jnym\u,DT)fMDT)Ju 

=  L'.'T'/(T,r*K,Cr), 

1=1 


which  is  a  mixture  of  approximately  normal  densities.  Inferences  on  can  be 

made,  therefore,  by  drawing  a  large  sample  from  the  mixture  of  normal  distributions 
and  using  the  sample  to  obtain  empirical  confidence  intervals.  A  single  draw  from  the 
distribution  of  is  obtained  by  first  drawing  u  from  the  discrete  distribution  of 

(u>(Z?r),  an<f  subsequently  drawing  yW  from  the  approximately  normal  distribution  of 
(VT)| u,Dt)-  This  process  is  repeated  until  the  sample  drawn  is  large  enough  to  make 
inferences  at  the  desired  level. 

Forecasts  of  Game  Outcomes 


A  predictive  estimate  of  the  probability  i  is  preferred  to  j  at  time  T  +  1,  given  data 
through  time  T,  is 


p<T+1> 

y*3 


(Dt) 


[  exP(7,-r+1)) 

J  exp(7jr+1,)  +  exp(7jr+1)) 

=  Jp^+i\u,DT)f(u\DT)du, 


(T+l) 

(V) 


Im 


(T+l) 

(0) 


C(T+1) 

’  So) 


)/(u>|Dr)«Mr+%f+,)du, 


where  Dt)  is  defined  in  (4.9),  and  computed  as  described  in  Section  4.4.1.  There¬ 

fore  the  integral  above  can  be  computed  as  a  weighted  average  of  p\J+1\u>,  Dt), 


m 

P%+l)(DT)  =  '£p<?+1\uJDT)4T+') 


t=i 


This  gives  a  posterior  estimate  of  the  preference  probability. 
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4.4.4  Sequential  Updating 

This  section  assumes  that  we  have  already  observed  data  through  time  T  and  have  ob¬ 
tained  both  the  conditional  posterior  distribution  of  (7^(0;,  Dt)  and  the  posterior  dis¬ 
tribution  of  (uj\Dt),  and  now  we  need  to  npdate  the  distributions  of  the  parameters  to 
incorporate  information  available  at  time  T  +  1. 


At  time  T,  we  have 

and  we  also  have  a  sample  of  m  values,  ,  wm,  for  which 


MDr)  = 


if  u  =  u >, 
ifw  ^ 


Using  the  methodology  of  Section  4.4.1,  we  can  update  the  distribution  of  (-y^T)|u;,  D?) 
employing  the  updating  and  forecasting  equations.  To  update  the  distribution  of  ( u\Dt ) 
to  the  distribution  of  (u>|Z?t+i),  we  apply  Bayes  rule  to  obtain 


f(uj\DT+i)  = 


ex 


f(dT+i\PT,v) 


f(u\DT) 


/(dr+il-Dr) 
f(dr+i\DT,u)  f{u\Dr). 


Laplace’s  method  gives 

/(dr+iU>r,uO  *  /(dr+i|7(r)  =  r)nY\^T\c^)(2xy>^\R-\-^, 

where  7*  and  Rm  are  defined  in  Section  4.4.2.  Therefore,  to  obtain  the  updated  distribu¬ 
tion  of  w  given  the  new  data  at  time  T  + 1,  we  reweight  the  distribution  of  (w|Pr)  by  the 
marginal  likelihood  /(dj+i|w,  Dt)  at  time  T  +  1, 

(T+i)  _  *,-r)/(dr+ikM-Pr) 

'  Z^1«'jT)/(rfr+1|wi,Dr)’ 

From  this  updated  distribution  of  w,  we  can  make  inferences  on  (7^r+1^|I?T-M)  or  compute 
forecast  estimates  using  the  methods  of  Section  4.4.3. 

4.5  Iterative  Simulation  Analysis  of  the  Dynamic  Bradley- 
Terry  Model 


A  drawback  of  the  non-iterative  analysis  described  in  Section  4.4  is  that  the  Bradley- 
Terry  likelihood  for  each  tournament  is  approximated  by  a  Gaussian  density.  It  is  this 
approximation  that  allows  the  marginal  posterior  distribution  of  player  parameters  to  be 
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obtained  in  closed-form  in  a  manner  analogous  to  the  normal  models  of  Chapter  2.  How 
ever,  if  the  tournaments  involve  a  small  number  of  games  then  the  normal  approximation 
may  not  be  satisfactory.  In  this  section,  we  consider  an  analysis  using  the  Gibbs  sampler 
to  draw  from  the  exact  joint  posterior  distribution  of  all  parameters.  A  description  of 
the  Gibbs  sampler  is  found  in  Section  2.5.1.  In  the  current  problem,  a  draw  from  each 
of  the  conditional  distributions  is  accomplished  via  a  Metropolis  step  (Metropolis  et  al. 
1953)  at  each  iteration  of  the  Gibbs  sampler.  The  Gibbs  Sampler  with  a  Metropolis  step 
is  described  in  Section  4.5.1.  We  then  describe  the  implementation  of  the  Gibbs  sampler 
for  the  dynamic  Bradley-Terry  model  in  Section  4.5.2.  In  Section  4.5.3  we  demonstrate 
methods  for  obtaining  an  approximate  marginal  posterior  distribution  of  and  pre¬ 
dictive  distributions  for  future  games,  as  well  as  methods  for  updating  the  parameters  to 
incorporate  new  data  using  the  technique  of  sequential  imputation  (Kong,  Liu  and  Wong 
1991). 


4.5.1  The  Gibbs  Sampler  in  Conjunction  with  the  Metropolis  Algorithm 

The  Gibbs  sampler,  as  described  in  Section  2.5.1,  draws  samples  iteratively  from  a  se¬ 
quence  of  conditional  distributions.  After  sufficiently  many  draws,  continued  application 
of  iterative  sampling  produces  samples  from  the  full  joint  distribution  of  the  parameters. 
In  some  cases,  even  though  the  conditional  densities  may  be  specified  exactly,  it  is  not  pos¬ 
sible  to  draw  a  sample  from  these  densities  directly.  The  Metropolis-Hastings  algorithm 
(Hastings  1970)  provides  a  remedy  to  this  problem. 

Suppose  we  would  like  to  draw  a  random  sample  from  a  distribution  F  with  density 
function  /(x),  and  we  know  how  to  draw  a  sample  from  a  distribution  G  with  density  g(x) 
both  defined  on  the  same  space  X.  The  Metropolis  algorithm  is  a  Monte  Carlo  Markov 
chain  method  that  sequentially  draws  values  Xj,X2, ...  in  a  manner  described  below  that 
eventually  produces  a  random  sample  from  F.  Denote  x0  6  X  an  arbitrary  initial  draw, 
and  let  Xi  be  the  i-th  draw  of  sequence.  To  obtain  the  (i  +  l)-th  draw  of  the  sequence,  we 
draw  a  trial  value  y  from  G ,  and  let 

f  y  w.p.  min(l,^gj) 

*+1  \  Xi  w.p.  i-min(l,^), 

where  w(-)  =  /(•)/?( •)>  the  importance  ratio.  Upon  convergence,  this  algorithm  will 
produce  a  random  sample  from  the  target  distribution  F  (Hastings  1970). 

The  Metropolis-Hastings  algorithm  can  be  incorporated  into  the  Gibbs  sampler  in  a 
variety  of  ways.  The  approach  we  consider  is  to  implement  one  Metropolis  step,  that 
is,  one  iteration  of  the  Metropolis-Hastings  algorithm,  at  each  draw  of  a  conditional  dis¬ 
tribution  in  the  Gibbs  sampler.  To  be  more  precise,  suppose  we  have  m  random  vari¬ 
ables  Zm,  and  we  are  able  to  specify  the  densities  of  the  conditional  distributions 

(ZijZj, . . .,  Z,_i, Z,+i, . . .,  Zm),  i  =  1, . .  .,m.  Without  loss  of  generality,  we  restrict  our 
attention  to  the  distribution  of  (Z\\Z2,  •  •  • ,  Zm).  At  iteration  q,  we  want  to  draw  from  the 
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exact  density 

Suppose  drawing  from  /(•)  is  difficult,  but  we  can  draw  z  from  g(Z\\Z^  t*  •  ^  )• 

Then  the  Metropolis  step  w;  * !  in  the  Gibbs  sampler  involves  assigning 


w.p.  min(l,w(zriv 
.p.  1  -  min(l, 


w 


where  w(-)  =  /(•)/<?(')•  I11  general,  one  Metropolis  step  is  performed  for  every  conditional 
distribution  in  the  Gibbs  sampler  for  which  drawing  a  sample  value  is  difficult.  It  is 
possible  to  perform  several  Metropolis  steps  when  sampling  from  a  particular  conditional 
distribution,  but  the  tradeoff  between  the  improved  convergence  properties  and  increases 
computational  costs  is  still  an  open  question. 


4.5.2  Gibbs  Sampler  Analysis  of  the  Dynamic  Bradley-Terry  Model 

The  goal  of  our  analysis  in  this  section  is  to  describe  a  method  of  drawing  a  random 
sample  of  parameter  values  from  the  joint  posterior  distribution  (7*1', . . . 

We  will  base  our  inferences  about  the  parameters  on  the  sample. 

Unlike  the  Gibbs  sampler  of  Section  2.5.1,  we  cannot  rely  on  the  conjugacy  of  the  likeli¬ 
hoods  and  the  prior  a  ttributions  to  simplify  our  analysis.  Consequently,  we  do  not  sample 
alternately  between  two  conditional  distributions,  but  rather  among  all  T  +  1  conditional 
distributions.  We  describe  now  the  sampling  strategy  for  alternately  drawing  from  the  dis¬ 
tributions  (7^|7^_t*,w,  Dt),  for  t  =  l,...,r,  where  7('f)  =  (7**\  •  7*l+1\- • -iT(T*), 

and  (o;|7(1),---,7(T)^r)- 

Steps  of  the  Gibbs  Sampler 

The  Gibbs  sampler  proceeds  as  follows: 


1.  Pick  a  starting  value,  u>c,  for  the  system  precision. 

2.  For  t  —  1, . . .,T, 

a.  Let  7^  be  the  current  value  of  7W. 

b.  Find  the  conditional  density  (up  to  a  scalar  constant),  /,  of  7W  given  the  re¬ 

maining  parameters. 

c.  Compute  the  approximate  mean,  Mt,  and  approximate  variance,  Vt,  of  the  con¬ 

ditional  distribution. 

d.  Draw  a  candidate  value  7*  from  a  normal  distribution  with  mean  Mt  and  variance 

Vt.  Let  the  normal  density  be  given  by  g. 
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3. 


e.  Accept  7’  as  the  new  current  value  7^  with  probability  min(l, 
7^  as  the  current  value  otherwise,  where  w(-)  =  f(-)/g{  ). 
Draw  ue  from 


M-ri") 


),  and  keep 


(w|7i1),...,7iT),l?r)  ~  Gamma<ao+p(r-l)/2,6o+^5Z(7i‘)-7i<  1))T(7i‘)-7i‘  1)))- 


t=2 


Repeat  steps  (2)  and  (3)  until  convergence. 

Sampling  from  (7(t)|7(-<),u;, Dt) 

This  section  describes  step  2  of  Gibbs  sampler  procedure  outlined  above.  To  draw  from 
the  conditional  distributions,  we  first  note  that  the  joint  distribution  of  all  parameters 
can  be  expressed  as 

/(7(1),...,7(T),w|Z)r) 

«  if(u\D0)}  X  {/(7(1V,A>)}  X  {/(<*,  |7(1U,A>)} 
x{/(7(2)|7(1),^,  D1)}  X  {/(</2|7(,),7(2),^  A)} 


x{/(7(T)|7(1),---,7(T  t)}  X  {f(dT\-f{1) , . .  DT-i)}, 

(4.13) 

where  the  conditional  densities  of  7W  are  normal  densities,  and  the  likelihoods  are  prod¬ 
ucts  of  Bradley-Terry  probabilities. 

The  conditional  posterior  distribution  for  7W  given  the  rest  of  the  parameters  has  a 
density  proportional  to  that  in  (4.13).  Let  uc  and  7!”^  be  the  current  draws  of  and 
7*~{),  and  let  7^  be  the  most  recent  draw  of  7W.  We  consider  the  case  where  1  <  t  <  T. 
Neglecting  terms  constant  with  respect  to  7W  yields 

/(7(t)l7i_<),^c,Z?T) 

«  {/(7(<)l7?"1)»«e,  A-i)}  X  {f{dt\yW,uc,Dt-i)}  x  {/(7l‘+1)l7(t).«e» Dt)}. 

(4.14) 

The  conditional  distribution  of  7W  is  proportional  to  the  product  of  a  normal  density 
with  mean  7c*-1^  and  variance  ^1,  a  Bradley-Terry  likelihood,  and  a  normal  density, 
where  7^  has  mean  7l<+1^  and  variance  ^1.  When  t  —  1,  the  first  term  in  (4.14)  is  the 
normal  prior  density  for  7^,  and  when  t  =  T,  the  last  term  in  (4.14)  vanishes.  Let  7W 
be  the  mode  of  the  likelihood  for  tournament  t  and  let  be  the  information  matrix 
evaluated  at  the  mode.  The  variance  and  mean  of  the  distribution  defined  by  the  product 
of  these  three  densities  is  given  by 

Vt  =  Var(7(,)|7^-t),u7c,  Z?r)  =  (ii(<)  +  2u;cI)_1 

Mt  =  E(7(‘)|7<'t),wc,Z?r)=Vi(it(t)7(<)+a;c(7<t-1)  +  7i<+,))).  (4.15) 
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Again,  when  t  =  1  or  I  =  T  these  expressions  change  to  reflect  the  appropriate  condi¬ 
tional  density.  Letting  0(T^lTe-*\we,  Dt)  be  the  normal  density  of  7^  with  mean  Mt 
and  variance  Vt,  and  letting  w(-)  =  /(-)/g(-),  we  draw  7*  from  the  multivariate  normal 
distribution  g.  The  Metropolis  step  gives  the  current  draw 

Y  w.p.  min(l,^^) 

w.p.  l-min(i,^^) 

This  completes  step  2  for  a  single  t  of  the  Gibbs  sampler. 

Sampling  from  (u>|7(1*,...,7(T\Dr) 


This  section  discusses  step  3  of  the  Gibbs  sampler  outlined  above.  The  conditional 
posterior  distribution  of  u  given  the  rest  of  the  parameters  does  not  require  a  Metropolis 
step.  Suppose  we  have  current  draws  7^,  •  •  •  ,7^,  and  we  would  like  to  draw  uic  from  the 
conditional  distribution  of  The  conditional  posterior  distribution 

of  u  is  proportional  to  the  density  in  (4.13).  Neglecting  terms  that  are  constant  with 
respect  to  u  and  substituting  in  the  exact  densities  yields 


/M7c1)»---i‘rcT)>-Dr) 
cx  u>a°-1e-bou' 


x  n  -  -rf-w’  - 


oc 


X  <*^r_1)/2exp(— ^  ^(7e*)  ~  7c  1))T('Jr[<)  —  7^*  l))). 


1=2 


so  that 

{u\'f{c1),...,'r^r\DT)  ~  Gamma( oo  +  p(T  —  l)/2,6o+  x ^(7^ ~ Trl<-1^)T(7c<) 

i  t=2 

(4.16) 

At  step  3  of  the  Gibbs  sampler,  we  draw  uc  according  to  the  gamma  distribution  in  (4.16). 


4.5.3  Sequential  Imputation  for  Parameter  and  Predictive  Inferences 

The  result  of  performing  the  Gibbs  sampler  is  a  large  sample  of  draws  from  the  joint 
posterior  distribution  of  (7^, . .  .,7^,w| Dj).  We  show  in  this  section  how  to  make 
inferences  on  parameters  of  interest  based  on  these  samples.  We  also  demonstrate  using 
sequential  imputation  a  method  for  forecasting  future  performances  and  for  updating  the 
parameter  distributions  when  new  data  is  observed. 
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Suppose  the  Gibbs  sampler  provides  m  samples  of  parameter  values  which  can  be 
considered  drawn  at  random  from  the  joint  posterior  distribution  of  the  parameters.  The 
collection  of  these  samples  can  be  listed  as 


{(Vi1’,  •  •  -,7lT)^l),(72X) . , 


where  the  subscripts  index  samples  rather  than  players.  We  can  approximate  the  exact 
joint  continuous  posterior  distribution  by  a  discrete  distribution  assigning  posterior  prob¬ 
abilities  of  irfT^  =  1/m  to  each  of  the  m  samples.  In  the  following  discussion,  we  refer  to 
the  posterior  probabilities  as  xp  to  retain  flexibility. 


Obtaining  approximate  inferences  about  parameters  is  trivial  in  this  framework.  To 
estimate  the  marginal  mean  and  variance  of  the  most  recent  player  parameters,  7^),  it 
suffices  to  find  the  expectation  over  Thus  the  mean  and  variance  for  are  given 
by 

ECr^lDr)  «  EW 

i=l 

Var(7(T)|flT)  *  £  *,(T)(t!T)  -  E(7(T)|Dr))(7.(T)  -  E(7(T)|X>t))\ 

t=i 

which  are,  of  course,  just  the  sample  mean  and  sample  variance-covariance  matrix.  Pos¬ 
terior  intervals  for  7<r)  can  be  computed  from  the  empirical  distribution  as  well. 

Inferences  about  (<r2|.Dy)  can  be  performed  analogously.  The  marginal  posterior  dis¬ 
tribution  on  (o2| Dt)  is  approximated  by  the  discrete  distribution  on  m  values,  of  =  1/w,, 
with  probabilities  *P  =  1/m.  Moments  of  the  true  distribution  can  be  approximated  by 
the  moments  of  the  discrete  approximating  distribution  as  before.  Confidence  intervals 
for  (<t2|Dx)  can  be  estimated  from  the  empirical  distribution. 


To  obtain  predictive  distributions  for  the  team  ratings  at  time  T  +  1,  we  make  use 
of  sequential  imputation  (Kong,  Liu  and  Wong  1991).  In  the  context  of  the  dynamic 
Bradley-Terry  model,  we  impute  m  values  for  7,-T+1^.  We  have  from  our  Gibbs  sampler 
posterior  draws  7^,... ,7^  of  (7(T)|I?t)>  and  of  (<r2| Dj)-  We  can  then 

approximate  the  distribution  of  (7^r+1^|Z?r)  by  drawing  m  sample  values  from 


7 


(7+1) 


NfrPVfl), 


thereby  obtaining  m  draws  from  the  marginal  distribution  of  (7*T+1)|Z?r)- 


To  estimate  the  prior  probability  that  player  j  defeats  player  k  at  time  T  + 1,  we  need 
to  find  the  value  of  the 

rfS'kDr)  =  /  pjr+,)(7(Tt,,.i>r)/(-r(T*,»|Dr)i-r<T+,> 

exp(7f>‘1) 


exp(7f+1))  +  exp(7r+10 


(7+1)- 


/(7<T+1>|ZW7(r+I). 


4.5  Iterative  Simulation  Analysis  of  the  Dynamic  Bradley-Terry  Model  70 


From  the  imputed  *  =  1,  — ,  m,  this  integral  can  be  approximated  by 

V'  ^(T) 

,=i  exp(7-f+1))  +  exp(^J+l)) 

where  is  the  j-th  component  of  the  vector  7-T+1*.  Thus  for  each  t,  we  compute  the 

Bradley-Terry  probability,  and  compute  its  expectation  over  the  empirical  distribution 
of  the  m  values. 


Updating  x^l  to  to  incorporate  tournament  data  at  time  T  +  1  requires  the 

application  of  sequential  imputation  in  a  manner  similar  to  that  of  Section  2.8.  From 
Bayes  rule,  we  have 


/(7(1), . .  .,7<T+1).w|i>r+i)  =  /(—!i0J;^|0T)'rl"‘l’“>/('r'1).- 

oc  /(dr+t|Z?T,  T(1) 3»  •  • .  ,T(T+1), w)  /(t(1),  •  •  M  7(T+1), u\Dt) 
-  /(<*r+ il7(T+1))  /( 7(1),  •  •  • ,  7(T+1),  w|/?r). 


This  last  equality  holds  because  the  distribution  of  game  outcomes  at  time  T  4-  1  given 
7<t+1)  is  independent  of  the  other  parameters  and  the  previous  data.  Thus  the  joint 
posterior  distribution  given  Dj+t  is  a  reweighting  of  the  joint  posterior  distribution  by  a 
factor  of  /(dT+i|7*r+1*)>  which  can  be  computed  exactly.  To  obtain  rjT+1\  we  reweight 
by  the  exact  densities,  that  is, 


(T+i)  "•|T)/(^T4il7,(T'l~1)) 

E”J*f)/(dT+i|7<T+1))' 


(4.17) 


/Jj  j\ 

From  this  new  set  of  values  of  x,  '  we  can  compute  posterior  credible  intervals  for 
parameters  of  interest,  such  as  -y(r+i)  or  gi  conditional  on  Dy+i- 


4.5.4  Summary  of  Iterative  Simulation  Analysis 

A  typical  analysis  of  tournament  data  will  therefore  proceed  in  the  following  manner. 

1.  We  observe  T  tournaments  of  game  outcomes,  and  perform  a  Gibbs  sampler  analysis 

to  obtain  m  draws  from  the  joint  posterior  distribution  of  (7W, . .  and 

set  -  1/m  for  t  =  1, . . ., m. 

2.  We  sequentially  impute  7(r+1*  for  each  of  the  m  draws  of  (7^, . .  by 

randomly  sampling  from  N(7|T',  ^-1). 

3.  We  now  observe  tournament  data  at  time  T  +  1,  and  update  the  weights  associated 

with  our  m  samples,  xj7"*,  according  to  (4.17)  to  obtain  rjT+1K 
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An  interesting  aspect  of  this  analysis  is  that  we  are  always  using  the  same  m  draws 
obtained  from  the  Gibbs  sampler  at  time  T.  A  possible  problem  with  this  analysis  is 
that  continued  sequential  updating  may  reveal  that  the  original  sample  of  parameters  do 
not  represent  the  true  distribution  accurately.  For  example,  after  updating  the  posterior 
distributions  to  incorporate  several  new  tournament  results,  we  may  find  that  the  values 
of  u>  obtained  from  the  original  sample  of  the  Gibbs  sampler  do  not  include  a  wide  enough 
range  of  values.  This  may  be  diagnosed  by  a  skewed  distribution  on  any  of  the  parameters 
after  several  sequential  updates,  or  alternatively  by  noting  that  few  Gibbs  samples  receive 
increasing  weight  in  imputation.  In  this  case,  it  may  be  necessary  to  reperform  the  Gibbs 
sampler  in  its  entirety  to  obtain  a  more  representative  sample  of  draws  from  the  posterior 
distribution. 


4.6  Ties 


In  many  situations,  a  paired  comparison  can  result  in  neither  object  being  preferred.  The 
models  considered  to  this  point  assume  that  a  comparison  results  in  a  binary  outcome. 
Here  we  develop  a  model  where  a  third  outcome,  no  preference,  is  possible. 


4.6.1  A  Model  for  Ties 

The  probability  model  we  consider  here  is  proposed  by  Davidson  (1970).  The  model 
postulates 

_ e* _ 

+  e1’  -f 

e~t>  4.  e<3  4. 

eA+(-r.+'r,)/2 

e-».  4-  e1!  4-  eA+(r*+rj)/2’  (4-18) 

where  the  parameter  A  is  an  index  of  discrimination.  Large  positive  values  of  A  indicate 
higher  probabilities  of  a  tie.  Equation  4.18  assumes  that  one  additional  parameter  is  suf¬ 
ficient  for  modeling  the  occurrences  of  ties.  One  generalization  that  has  been  examined 
incorporates  a  separate  parameter  A, j  for  each  pair  being  compared  (Singh  and  Gupta 
1975;  Singh  and  Gupta  1978).  Another  possible  generalization  is  to  allow  one  tie  param¬ 
eter,  A,,  per  player.  This  possibility  is  discussed  in  Section  5.7.  A  further  attempt  at 
extending  the  Bradley- Terry  model  to  incorporate  ties  refers  back  to  the  extreme  value 
score  distribution  as  a  motivation  for  the  Bradley- Terry  model.  Rao  and  Kupper  (1967) 
proposed  a  model  that  postulates  a  tie  is  declared  when  the  difference  between  two  scores 
is  less  than  a  certain  threshold  parameter. 


p,j  =  Pr(t  defeats  j)  = 
Pji  =  Pr(j  defeats  i)  = 
Pij. o  =  Pr(:  ties  j)  = 


The  Bayesian  analysis  of  the  Davidson  model  proceeds  in  a  manner  analogous  to 
Section  4.2.  We  impose  a  multivariate  normal  prior  distribution  on  the  parameters 
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The  analysis  approximates  the  distribution  of  the  parameters  by  a  mul¬ 
tivariate  normal  distribution  centered  at  the  maximum  likelihood  estimates.  The  approx¬ 
imate  multivariate  normal  posterior  distribution  is  obtained  by  the  usual  averaging  of  the 
normal  prior  distribution  with  the  approximate  normal  likelihood.  Again,  this  procedure 
obtains  the  exact  posterior  distribution  only  if  the  likelihood  is,  in  fact,  approximately 
normal. 


We  propose  a  dynamic  generalization  to  the  Davidson  model.  Let  7W  =  (7}^, ....  7^ ) 
be  the  vector  of  player  ratings  at  time  t,  and  let  the  tie  parameter.  A,  be  constant  over 
time.  Then  the  probability  model  for  observations  at  time  t  is  given  by 


p\y  =  Pr(t  defeats  j) 
=  Pr (j  defeats  t) 
P.%  =  Pr(*  ties  J) 


e*. 


<«) 


tP  +  eP  + 

o(,) 
e  o 


e^  +  e^  +  e^'^r)/2 
cA+h|*)-pyJ,))/a 

^,,  +  e^  +  e^,W))/»' 


(4.19) 


The  probability  model  for  the  evolution  of  the  parameters  7W  is  given  by 

T(‘)  _  7(‘-t)  +  „(<) 


and  as  in  Section  4.3  we  let 

(i/<‘)|<72)~N(at,<72Ip), 

where  at  is  the  mean  amount  by  which  team  abilities  change  between  tournaments  t  —  1 
and  i,  and  <x2  is  the  variance  that  reflects  the  increased  uncertainty  in  7M  with  the  passage 
of  time. 


We  place  a  prior  distribution  on  (V1),  A), 

(7(,\A)~N(m(1),C<1>), 

where  and  are  ( p  +  l)-dimensional.  The  prior  distribution  onw=  1/<t2  is 

f(tjj)  «  aiO0_1e~6°u'. 

The  hyperparameters  ao,  and  60  are  assumed  to  be  specified  in  advance. 

It  is  worth  noting  the  similarities  and  differences  between  the  dynamic  Bradley-Terry 
models  with  and  without  ties.  The  model  for  the  evolution  of  parameters  treats  the  7^’ 
as  time  varying  in  both  models,  and  the  amount  of  change  over  time  is  governed  by  the 
variance  parameter  a2.  The  model  also  assumes  that  the  prior  distribution  of  7'1)  is 
multivariate  normal,  and  both  models  assume  an  identical  prior  distribution  for  a2.  A 
noteworthy  difference  between  the  models,  besides  the  allowance  for  a  third  outcome  of 
the  paired  comparison,  is  that  the  parameter  A  is  treated  as  constant  over  time.  In  the 
next  section,  we  discuss  the  implications  of  this  difference  on  the  analysis  of  the  model. 


i 
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4.6.2  Analysis  of  the  Model 

The  analysis  of  the  dynamic  paired  comparison  model  with  ties  requires  only  minor 
changes  to  the  analyses  described  in  Sections  4.4  and  4.5.  We  describe  the  required 
modifications  to  the  non-iterative  and  Gibbs  sampler  analyses  here. 

As  in  the  paired  comparison  models  with  binary  outcomes,  the  non-iterative  analysis  of 
the  model  with  ties  involves  approximating  the  likelihood  at  each  time  t  by  a  multivariate 
normal  density.  For  the  Davidson  model,  Bradley  and  Gart  (1962)  show  how  to  obtain  the 
maximum  likelihood  estimates  and  the  estimated  information  matrix  for  each  tournament. 
The  likelihood  can  be  approximated  by  a  multivariate  normal  density  by  setting  the  mean 
to  be  the  vector  of  maximum  likelihood  estimates,  and  the  inverse  of  the  variance  to  be 
the  estimated  information  matrix.  We  let  be  the  (p  +  1)- vector  A),  and  let  7^ 
be  the  maximum  likelihood  estimate  of  7!**  from  the  data  at  time  t  and  let  ftl'*  be  the 
(p  +  1  )-dimensional  information  matrix. 

The  analysis  of  the  model  conditional  on  a3  =  1  /u  requires  the  respecification  of  the 
updating  and  forecasting  calculations.  Suppose  before  observing  tournament  at  time  2, 
the  prior  distribution  on  the  parameters  is 

(7iV,A-l)~N(M<‘>,C(‘>). 

Upon  observing  data  at  time  t,  dt,  we  can  update  the  distribution  above  to  obtain  the 
posterior  distribution 

where 

p,(‘)  =  (C<‘>  -1  +  i2i,))-1(C(‘>  -V°  + 

c'(')  =  (c^-1  +  sit))~1- 

The  forecasting  calculation,  which  computes  the  distribution  of  the  parameters  reflecting 
the  uncertainty  due  to  the  passage  of  time,  is  given  by 

(7i‘+1V,.Dt)  -  N(m<‘+1>,C<‘+1>), 

where 

=  M'(0 

C(‘+D  =  C'W  +  ^Jp.j. 

Here,  JP(i  is  the  (p  +  l)-dimensional  matrix  consisting  of  1  in  the  first  p  diagonal  ele¬ 
ments,  and  0  elsewhere.  The  forecasting  formula  for  the  variance-covariance  matrix  has 
the  interpretation  that  the  variance  increases  over  time  for  7^,  but  the  variance  of  the 
non-dynamic  parameter,  A,  does  not  change  due  to  the  passage  of  time.  These  forecast¬ 
ing  formulas,  along  with  the  updating  formulas,  can  be  used  successively  to  obtain  the 
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marginal  distribution  of  V.7"*  from  the  prior  distribution  on  the  parameters  conditional 
on  a1 . 

Obtaining  the  distribution  of  (u\Dr)  is  identical  to  the  analysis  described  in  Sec¬ 
tion  4.4.2.  We  assume,  as  before,  a  discrete  prior  distribution  on  (w|Z?q).  To  obtain 
the  posterior  distribution  on  (w| D?),  we  compute  the  reweir-  :ing  factors  as  products  of 
marginal  likelihoods  which  can  be  approximated  using  Lapia.ce’6  method.  Once  the  ap¬ 
proximate  distribution  of  (wj2?x)  is  found,  posterior  moments  and  confidence  intervals  can 
be  found  using  the  approximating  discrete  distribution.  Prediction  probabilities  can  be 
found  via  Monte  Carlo  simulation  by  generating  samples  from  the  approximately  normal 

/T  i  i\ 

(T.  | Dt),  and  then  computing  the  preference  probabilities  for  each  of  the  generated 

samples. 

The  Gibbs  sampler  methodology  for  the  model  with  ties  is  similar  to  the  analysis 
considered  in  Section  4.5.  The  main  difference  is  now  that  at  each  iteration  of  the  Gibbs 
sampler,  draws  from  T  +  2  conditional  distributions  are  required;  drawing  from  the  distri¬ 
bution  of  (wh't1), . .  .,-jriTl,  A  ,Dt),  from  each  of  the  X,u,Dt)  for  t  =  1, . .  ,,T, 

and  from  (Afr^i . . .  ,u/,  Dt)-  We  briefly  describe  here  the  sampling  strategy  for  this 

Gibbs  sampler. 

Drawing  from  the  distribution  of  (V^M***.  A ,u,Dt)  is  similar  to  Section  4.5.2  with 
the  Bradley- Terry  likelihoods  replaced  by  the  Davidson  likelihoods.  In  particular,  a  mul¬ 
tivariate  normal  distribution  is  constructed  from  which  a  candidate  vector  for  will 
be  sampled.  The  parameters  of  this  multivariate  normal  distribution  are  obtained  in  a 
manner  identical  to  the  procedure  of  Section  4.5.2;  a  single  draw  is  sampled,  and  the 
ratio  of  importance  weights  is  computed  by  calculating  the  densities  of  both  the  normal 
distribution  from  which  the  draw  was  sampled,  and  the  density  corresponding  to  the 
exact  conditional  posterior  distribution  (the  target  distribution).  The  Metropolis  step 
then  either  accept  the  candidate  vector  tt'*)  or  retains  the  previous  value.  In  this  entire 
computation,  A  is  treated  as  fixed  and  known. 

To  draw  from  the  distribution  of  (A|V*\ _ ,  w,  Dt)-,  we  also  make  use  of  a 

Metropolis  step.  First,  before  performing  the  Gibbs  sampler,  we  construct  a  normal 
distribution  from  which  to  sample  a  candidate  value  of  A.  To  do  this,  we  make  a  rough 
guess  at  the  mean  and  variance  of  the  marginal  posterior  distribution  of  A  by  averaging  the 
posterior  means  and  the  posterior  variances  obtained  by  considering  the  T  tournaments 
individually.  We  then  inflate  this  variance  by  a  positive  value  greater  than  1  to  ensure 
that  the  candidate  samples  are  dispersed  relative  to  the  true  posterior  distribution  of  A. 
This  same  normal  approximation  is  used  to  generate  candidate  values  of  A  throughout  the 
Gibbs  sampler.  Sampling  A  conditional  on  the  remaining  parameters  involves  drawing  a 
candidate  value  of  A  from  the  overdispersed  normal  distribution,  and  then  invoking  the 
Metropolis  step  by  computing  the  ratio  of  importance  weights  and  accepting  the  candidate 
value  with  the  appropriate  probability. 
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To  sample  from  (u>|7(1), . .  Dt),  the  conditional  distribution  is  given  by 

,  T-i 

. .  .,7(r),  A,  Dt)  ~  Gamma(ao+p(r-l)/2,6o+-  ^2('r(,+1)-'r(*))T(7,t+1)-'y(t))) 

1  !=1 

Note  that  the  distribution  of  u  does  not  involve  A,  as  the  tie  parameter  is  assumed  to 
remain  constant  over  time.  This  completes  a  single  draw  from  the  Gibbs  sampler.  The 
algorithm  continues  until  the  Gibbs  sampler  converges,  after  which  a  random  sample 
from  the  marginal  posterior  distribution  of  all  parameters  may  be  drawn,  and  techniques 
presented  in  Section  4.5.3  may  then  be  employed  to  find  marginal  inferences  and  predictive 
probabilities.  The  approach  described  here  can  be  used  to  include  more  than  one  non¬ 
dynamic  covariate.  In  fact,  in  Chapter  5  we  allow  for  an  order  effect  (p^  depend  on  the 
order  in  which  i  and  j  are  compared)  in  addition  to  allowing  for  ties. 


4.7  Inclusion  of  Covariates 


In  both  the  Bradley- Terry  model  and  the  Davidson  model,  extensions  can  be  made  to 
incorporate  covariate  information.  We  assume  covariates  can  be  modeled  as  i  linear 
combination  of  parameters,  and  that  these  parameters  are  non-dynamic.  Generalizing  to 
dynamic  parameters  is  also  straightforward,  although  we  do  not  consider  that  case  here. 

Suppose  the  q-dimensional  vector,  /?,  contains  the  parameters  for  the  q  covariates. 
Let  Xk  be  a  data  vector  corresponding  to  the  fc-th  individual  paired  comparison,  and 
suppose  the  fc-th  comparison  involves  players  **  and  j*.  The  vector  Xk  is  the  set  of 
covariates  associated  with  the  fc-th  game.  We  then  consider  the  following  generalizations 
of  the  Bradley-Terry  and  Davidson  models  for  paired  comparisons.  For  the  Bradley- Terry 
model,  we  let 

exp(7  «t  +  xk0) 

P,klk  exp(7  u  +  **/?)  +  exp(7>J 

expCTu  ~  In  +  ffcft) 
exp (%,  -  7j*  +  *kP)  +  1 ' 

For  the  Davidson  model,  we  postulate 

_ exp(7u  +  XkP) _ 

exp(7<*  +  xk0)  +  exp(7>t)  +  exp(A  +  (j,k  +  7>J/2) 

_ _ exp(7Jt) _ 

exp(7<k  +  XkP)  +  exp(7>lk)  +  exp(A  +  (7.„  +  7j*)/2) 

_ exp(A  -I-  {nk  -f  7>t)/2) _ 

exp(7<*  +  xk/})  +  exp(7>t)  +  exp(A  +  (7 <(k  +  7,.  )/2)' 

These  models  can  be  fit  by  performing  maximum  likelihood  estimation  in  the  usual  way. 
Critchlow  and  Fligner  (1991)  demonstrate  that  these  paired  comparison  models  can  be 


Pikik  = 

Pjk'k  — 
Pikik- 0  = 
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reparametrized  as  generalized  linear  models;  they  take  advantage  of  this  structure  in  order 
to  analyze  the  models  using  standard  statistical  packages.  Estimates  of  7  and  Q  along 
with  the  singular  covariance  matrix  can  be  obtained  by  recognizing  and  implementing  the 
paired  comparison  model  as  a  generalized  linear  model. 

The  Bayesian  extension  to  these  models  is  analogous  to  the  Bayesian  extension  of  the 
ordinary  Bradley-Terry  model  in  Section  4.2.  A  multivariate  normal  prior  distribution  is 
placed  on  all  the  parameters,  and  the  likelihood  is  approximated  by  a  normal  density  in 
order  to  carry  out  an  analysis  that  is  approximately  conjugate.  We  can  then  report  the 
posterior  distribution  of  parameters  as  approximately  normal. 

The  Bayesian  model  can  be  further  extended  to  a  dynamic  model  in  analogy  with 
Section  4.3.  It  is  assumed  here  that  the  0  remain  constant  over  time.  The  analysis  of 
such  a  model  is  identical  to  the  analysis  of  the  dynamic  Davidson  paired  comparison 
model  of  Section  4.6.2.  Once  again  the  parameters  can  be  partitioned  into  those  that  are 
dynamic  (the  7^)  and  those  that  are  non-dynamic  (A,/3).  Both  the  iterative  and  non¬ 
iterative  analyses  of  Section  4.6.2  can  be  extended  trivially  by  allowing  for  the  greater 
number  of  non-dynamic  parameters. 


Chapter  5 


Analysis  of  Chess  Game 
Outcomes 

In  this  chapter,  we  apply  the  methodology  developed  in  Chapter  4  to  chess  tournament 
outcomes  collected  over  a  period  of  time.  The  model  developed  here  is  an  extension  of 
the  Bradley- Terry  model  with  ties  that  incorporates  an  order  effect.  The  model  also 
incorporates  a  dynamic  component  that  specifies  the  evolution  of  chess  players’  abilities 
over  time.  We  describe  the  World  Cup  chess  tournament  data  set  that  is  used  in  our 
analysis  in  Section  5.1.  In  Section  5.2,  the  model  for  chess  game  outcomes  is  described, 
and  in  Section  5.3  we  describe  the  implementation  of  the  model  for  the  World  Cup  dcta. 
We  present  the  results  of  the  analysis  and  posterior  inferences  for  the  rating  parameters 
in  Section  5.4,  and  also  show  how  the  model  can  be  used  for  predictive  inferences.  This  is 
followed  in  Section  5.5  by  an  examination  of  model  diagnostics  using  posterior  predictive 
checks  (Rubin  1984).  We  report  the  Gibbs  sampler  analysis  of  simulated  data  with  a 
greater  number  of  games  between  players  in  Section  5.6.  Section  5.7  presents  possible 
computational  improvements,  and  suggests  modifications  of  the  model  for  chess  game 
outcomes. 


5.1  World  Cup  Chess 


To  demonstrate  the  methodology  and  techniques  developed  in  Chapter  4,  we  consider  a 
model  for  chess  performance  applied  to  the  chess  World  Cup  of  1988-1989.  The  World 
Cup  consisted  of  six  tournaments  and  was  comprised  of  29  of  the  world’s  top  chess  players. 
Information  on  the  World  Cup  tournaments  was  obtained  from  Kavalek  (1990).  Table  5.1 
lists  the  dates  and  sites  of  each  of  the  six  tournaments,  and  the  number  of  competitors 
who  participated  in  each  tournament.  Of  the  29  players,  22  players  competed  in  4  of  the  6 
tournaments,  3  players  competed  in  3  of  the  tournaments,  and  4  players  only  competed  in 
1  tournament.  The  World  Cup  participants  who  competed  in  3  or  more  tournaments  were 
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Tournament 

Dates 

Numoer  of  Competitors 

Brussels,  Belgium 

H  April  1,  1988  -  April  22,  1988 

18 

Belfort,  France 

June  14,  1988  -  July  3,  1988 

16 

Reykjavik,  Iceland 

October  3,  1988  -  October  24,  1988 

18 

Barcelona,  Spain 

March  30,  1989  -  April  20,  1989 

17 

Rotterdam,  Netherlands 

June  3,  1989  -  June  24,  1989 

16 

Skelleftea,  Sweden 

August  12,  1989  -  September  3,  1989 

16 

Table  5.1:  World  Cup  Chess  Tournaments,  1988-1989 


contenders  for  monetary  prizes.  Table  5.2  lists  the  players  and  indicates  the  tournaments 
in  which  each  player  competed. 

For  each  game  in  the  World  Cup,  the  data  consists  of  the  players  involved  in  the  game, 
the  outcome  of  the  game  (win,  loss  or  draw),  an  indication  of  which  player  played  the 
white  pieces  (the  player  with  the  white  pieces  moves  first),  and  the  tournament  in  which 
the  game  occurred.  Each  tournament  is  a  single  round-robin  (i.e.,  each  player  plays  every 
other  player  exactly  once),  except  the  first  tournament  in  which  one  player  withdrew  after 
playing  four  games.  The  data  consists  of  a  total  of  789  games  spanning  17  months. 


5.2  A  Model  for  Chess  Game  Outcomes 


The  probability  model  that  we  assume  for  chess  game  outcomes  is  an  extension  of  the 
Bradley- Terry  model.  The  extension,  due  to  Davidson  and  Beaver  (1977),  includes  a  tie 
as  a  possible  outcome  of  a  comparison,  and  also  incorporates  a  parameter  for  a  within-pair 
order  effect.  We  would  like  to  model  an  order  effect  because  the  player  with  the  first  move 
in  a  chess  game  is  commonly  believed  to  have  an  advantage.  The  probabilities  of  the  three 
game  outcomes  for  tournament  t,  given  that  player  i  moves  first,  axe  specified  as 

=  Pr(i  defeats  ;) 

P*‘(.)  =  Pr (i  defeats  *) 

p!‘(,).o  =  Pr(*  ties  j) 

where  the  parenthesized  subscript  (i)  indicates  that  player  i  moved  first. 

The  model  in  (5.1)  extends  the  Davidson  model  of  Section  4.6.1  by  including  the  order 
effect  parameter,  tj.  If  rj  takes  on  a  positive  value,  then  the  probability  of  winning  for 
the  the  player  with  the  first  move  is  lower  than  for  his  opponent,  whereas  when  tj  takes 


e'r« 


,<»> 


e**  +  eW*  +  eA+<-W>/* 


eA+(-r<,)+'r),))/2 


(5.1) 
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T9 

Player 


Andersson 


Belyavsky 


Ehlvest 


Hjartarson 


Hubner 


Illescas 


Karpov 


Kasparov 


Korchnoi 


Ljubojevic 


Nikolic 


Noguieras 


Nunn 


Petursson 


Portisch 


Ribli 


Salov 


Sax 


Seirawan 


Short 


Sokolov 


Spassky 


Speelman 


Tal 


Timman 


Vaganian 


Vanderwiel 


Winants 


Yusupov 


Brussels  I  Belfort  I  Reykjavik  |  Barcelona  |  Rotterdam  |  Skelleftea 


Table  5.2:  World  Cup  Chess  Participants,  1988-1989 
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on  a  negative  value,  the  player  with  the  first  move  has  a  higher  probability  of  winning. 
In  chess,  we  would  anticipate  that  r?  is  negative.  When  rj  is  0,  the  model  indicates  no 
advantage  or  disadvantage  to  moving  first,  and  the  model  reduces  to  the  Davidson  model. 


We  assume  the  evolution  of  the  player  strength  parameters  is  governed  by 

7(<)  =  7(‘-i)  + 


with 

v(l)~N(0,i(Vl),  (5.2) 

where  and  -pi*-1)  are  the  vectors  of  rating  parameters  for  tournaments  t  and  t  —  1, 
respectively.  Because  tournaments  are  not  separated  by  equal  amounts  of  time,  we  define 
<r2  to  be  the  variance  per  month  and  let  be  the  number  of  months  between  tournaments 
t  and  t  —  1.  We  assume  a  prior  distribution  on  the  parameters 

(7(1U,»?)~N(m(1),C<1>). 

As  before,  */(*)  is  the  change  incurred  in  the  player  parameters  from  tournament  to  tour¬ 
nament.  The  factor  in  (5.2)  is  justified  by  treating  the  amount  of  variability  from 
month  to  month  as  independent,  so  that  JtM  months  passing  between  tournaments  corre¬ 
sponds  to  summing  independent  random  variables  with  mean  0  and  variance  a2 .  For 
our  analysis,  the  fcW  are  known  and  observed  quantities,  not  necessarily  integers.  Notice 
that  the  variation  per  month  in  player  abilities  is  assumed  constant  over  time,  and  the 
same  for  all  players.  We  assume  that  the  draw  parameter  A  and  the  color  parameter  rj 
remain  constant  over  time. 


We  choose  a  vague  prior  distribution  for  the  parameters  of  our  model  to  reflect  our 
initial  uncertainty.  The  prior  parameters  are 

=  (0,..., 0,1, -.25) 

C(1)  =  10  - 1. 

These  prior  parameters  assign  the  same  mean  to  each  player’s  rating  parameter,  but  assign 
a  large  variance  as  an  indication  of  our  uncertainty.  The  prior  mean  of  1.0  for  A  and  the 
prior  mean  of  -  .25  for  t?  imply  that  players  of  equal  ability  will  draw  games  slightly  over 
60%  of  the  time  and  that  the  player  who  moves  first  will  win  56%  of  the  games  that  do 
not  result  in  draws. 

The  prior  distribution  on  u  =  l/o2,  the  precision  per  month,  is  also  vague,  with 
improper  density 

/(w)  =  u>~3/3 

which  corresponds  to  a  uniform  prior  distribution  on  o.  The  values  of  k M,  t  =  2,..., 6, 
for  the  World  Cup  data  are  1.7,  3.0,  5.2,  1.4,  and  1.6.  These  are  the  number  of  months 
between  successive  tournaments,  and  are  treated  as  fixed  in  the  analysis. 


Our  model  for  chess  game  outcomes  has  strong  connections  with  the  Elo  chess  rating 
system  (Elo  1978)  that  is  used,  in  two  different  forms,  by  the  U.S.  Chess  Federation  and 
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the  World  Chess  Federation,  as  well  as  other  national  chess  federations.  The  Elo  system 
assumes  that  the  expected  score  of  a  game  played  between  players  t  and  j,  where  the 
score  is  a  random  variable  taking  the  value  1  for  a  win,  0  for  a  loss,  and  ^  for  a  draw,  is 
the  Bradley- Terry  probability  e7‘ /(e7‘  +  eyj).  The  “Elo  rating”  of  a  player,  R„  is  related 
to  the  rating  parameter  7,  of  the  Bradley-Terry  model  as  R,  =  The  model  in 

(5.1)  improves  on  some  of  the  drawbacks  of  Elo’s  system  in  that  we  directly  model  the 
probability  of  a  draw  as  a  third  outcome,  and  we  include  a  parameter  for  an  order  effect. 
Also,  although  Elo’s  model  recognizes  that  players’  abilities  may  change  over  time,  the 
updating  algorithm  does  not  distinguish  between  varying  amounts  of  player  inactivity. 

Many  variations  of  paired  comparison  models  have  been  proposed  for  rating  chess 
players.  Batchelder  and  Bershad  (1979)  propose  an  extension  of  the  Thurstone-Mosteller 
model  that  incorporates  ties  and  updates  players’  abilities  by  taking  weighted  averages  of 
past  performances.  Joe  (1990)  examines  the  world’s  best  chess  players  of  all  time  by  with 
a  model  that  splits  players’  careers  into  “peak”  periods  and  “off-peak”  periods.  Henery 
(1992)  analyzes  the  same  data  set  as  Joe  using  an  extension  of  the  Thurstone-Mosteller 
model  that  allows  for  draws,  and  proposes  using  the  length  of  a  game  as  a  covariate  for  the 
analysis  of  a  larger  chess  game  data  set.  In  a  more  developmental  approach,  Joe  (1991) 
derives  axiomatically  a  general  framework  for  a  rating  system,  and  shows  how  the  Elo 
system  is  a  special  case. 


5.3  Model  Implementation 


We  describe  in  this  section  the  implementation  of  the  non-iterative  analysis  and  the  Gibbs 
sampler  analysis  in  order  to  obtain  posterior  inferences  of  parameters  for  the  model  of 
Section  5.2.  We  indicate  below  how  different  time  intervals  affect  the  analyses. 


5.3.1  Gibbs  Sampler  Implementation 

The  implementation  of  the  Gibbs  sampler  here  parallels  that  of  Section  3.1.2.  We  ran  three 
parallel  Gibbs  samplers  with  overdispersed  starting  values  of  or:  1/52,  l/.l2,  and  1/.0012. 
We  believe  that  a  =  5  and  a  =  .001  are  towards  the  extremes  of  the  posterior  distribution 
of  cr,  so  that  our  choice  of  starting  values  is  overdispersed  for  the  target  distribution. 
The  starting  values  for  7W  were  obtained  by  first  setting  7W  to  0,  and  then  adding 
Gaussian  noise  with  standard  deviation  VkWff  to  7W  to  obtain  7*,+1*.  The  initial  values 
in  the  Gibbs  sampler  for  the  77  and  A  for  all  t  were  0.  Each  Gibbs  sampler  proceeds  by 
drawing  in  sequence  from  each  of  the  conditional  distributions  of  7W,  given  the  remaining 
parameters,  followed  by  a  draw  from  their  conditional  distribution  of  (A,  77),  followed  by  a 
draw  from  the  conditional  distribution  of  u.  Because  the  time  between  tournaments  vary, 
drawing  from  the  conditional  posterior  distribution  of  7W  involves  the  inter-tournament 
intervals  k^  via  their  contribution  to  the  between-tournament  variances.  This  does  not 
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cause  any  difficulty  in  the  Gibbs  sampler  analysis,  as  the  term  w  is  replaced  by  u/k^ 

in 

the  conditional  densities.  The  conditional  posterior  distribution  of  u  i6 


T 

M7<1),...,7f\^r)~Gamma(oo+p(T-l)/2,M-^^('r<‘)--r<‘-l))T(Vc,)-7(c'-1))) 


where  7C  refers  to  the  current  draw  of  the  player  rating  parameters  in  the  Gibbs  sampler. 
This  gives  the  appropriate  distrit  ion  of  the  precision  parameter  when  tournaments  are 
separated  by  known  but  varying  numbers  of  months. 


Convergence  of  the  three  series  was  assessed  by  computing  the  potential  scale  reduction 
as  in  Gelman  and  Rubin  (1992a).  After  500  iterations  of  the  Gibbs  sampler  for  each 
series,  we  obtained  a  potential  scale  reduction  of  195.33  computed  on  log,0u.  This  value 
provides  strong  evidence  that  the  Gibbs  sampler  has  not  yet  sampled  from  the  target 
distribution.  Figure  5.1  shows  the  values  of  log10u>  at  each  iteration  in  the  three  series, 
and  clearly  indicates  that  many  more  iterations  of  the  Gibbs  sampler  are  necessary  before 
convergence. 


Several  possible  reasons  exist  for  the  slow,  or  perhaps  lack  of,  convergence  of  the  Gibbs 
sampler  for  the  World  Cup  Chess  data.  First,  the  Gibbs  sampler  for  the  World  Cup  data 
alternates  among  7  +  2  =  8  conditional  distributions,  whereas  the  Gibbs  sampler  for 
the  NFL  data  alternates  between  only  two  conditional  distributions.  It  is  suggested  in 
Liu  (1992)  that  a  Gibbs  sampler  that  collapses  over  parameters  (i.e.,  draws  from  the 
conditional  distribution  of  several  parameters  are  made  simultaneously  rather  than  in 
sequence)  will  converge  more  quickly  than  one  that  does  not.  Additionally,  if  the  joint 
posterior  distribution  has  parameters  that  are  highly  correlated,  as  is  probably  the  case 
with  the  World  Cup  data  due  to  the  time-dependence  of  the  the  Gibbs  sampler 
may  take  many  iterations  to  move  from  arbitrary  regions  of  the  parameter  space  to  the 
high  likelihood  regions  under  the  target  distribution.  Another  explanation  for  the  slow 
convergence  concerns  the  implementation  of  the  Metropolis  step  within  the  Gibbs  sampler. 
In  the  Gibbs  sampler  for  the  NFL  data,  every  trial  draw  is  accepted  because  we  sample 
directly  from  the  exact  conditional  distributions.  The  Gibbs  sampler  with  the  Metropolis 
step  for  the  World  Cup  data  results  in  acceptable  draws  60.7%  of  the  time.  This  suggests 
that  without  accounting  for  the  dependence  among  the  parameters,  the  World  Cup  chess 
analysis  would  be  expected  to  require  about  1/.607  =  1.65  times  as  many  iterations  as  the 
NFL  analysis  in  order  to  produce  the  same  movement  about  the  parameter  space.  A  third 
possible  reason  for  the  slow  convergence  is  that  the  information  contained  in  the  data  is 
not  strong  enough  to  cause  the  Gibbs  sampler  to  move  quickly  to  regions  of  the  parameter 
space  of  high  likelihood.  Whereas  the  NFL  game  score  data  give  a  better  indication  of 
differences  in  team  abilities,  the  World  Cup  data  only  indicate  a  player  who  wins  a  game 
(or  whether  a  game  ends  in  a  draw). 


Because  the  Gibbs  sampler  does  not  appear  to  converge  within  500  iterations,  we  do 
not  base  our  analyses  of  the  World  Cup  data  on  the  results  of  the  Gibbs  sampler.  We 
illustrate  the  Gibbs  sampler  in  Section  5.6  with  simulated  tournament  data  consisting  of 
10  games  per  player  pair  within  each  tournament. 


5.3  Model  Implementation 


83 


5? 


* '  V> 


A*' 


.i  y*\  »A. 


■  .  <  ;V 

tff'v?  ■' 


~li.  A 
\  •  ‘ 


— I — 
100 


T 


200 
Iteration 


300 


— I — 
400 


— r~ 

500 


Figure  5.1:  Three  parallel  Gibbs  sampler  series  of  log10ui 
5.3.2  Non-iterative  Analysis 

The  non-iterative  analysis  of  Section  4.4  can  also  be  applied  to  the  World  Cup  model.  The 
distributions  of  parameters  at  time  t  are  approximated  by  multivariate  normal  distribu¬ 
tions  centered  at  the  maximum  likelihood  estimates.  The  maximum  likelihood  estimates 
are  obtained  by  fitting  an  appropriate  GLIM  model  as  described  in  Critchlow  and  Fligner 
(1991).  Approximate  variance-covariance  matrices  are  obtained  according  to  Palmgren 
(1981),  who  describes  methodology  to  obtain  asymptotic  variances  for  multinomial  re¬ 
gression  models.  The  prior  distribution  of  a  is  approximated  according  to  /(<t)  =  1/51 
for  a  =  1/vC  €  (0,  This  set  of  values  if  chosen  to  span  the  posterior 

distribution  of  a.  The  inclusion  of  the  time  intervals,  k^\  requires  only  minor  modifi¬ 
cations  of  the  non-iterative  analysis.  Forecasting  between  tournaments  t  and  t  +  1  now 
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becomes 

C<‘+1>  =  C'<‘>  +  Jp,2, 

where  JPi2  is  the  identity  matrix  with  the  last  two  diagonal  elements  (corresponding  to 
the  variance  of  the  change  in  the  Jraw  and  order  parameters)  set  to  0.  The  reweighting 
factors,  are  computed  by  the  methods  of  Section  4.4.2,  after  noting  that 

the  forecasting  equations  involves  the  addition  of  Jb^<r2Jp>2  rather  than  <72JP,2. 


5.4  Results  of  World  Cup  Analysis 


In  this  section,  we  show  the  results  of  the  non-iterative  analysis  of  the  World  Cup  chess 
data.  We  examine  the  posterior  distribution  of  the  system  standard  deviation,  <7,  and 
obtain  the  approximate  marginal  posterior  distributions  of  7^,  A,  and  rj.  The  players 
might  be  ranked  according  to  the  marginal  posterior  mean  vector, 


5.4.1  Marginal  Posterior  Distribution  of  a 

Figure  5.2  displays  the  approximate  posterior  distribution  of  a,  the  system  standard 
deviation,  from  the  non-iterative  analysis.  The  figure  shows  a  right-skewed  distribution 
with  values  of  a  near  0  having  most  support.  With  the  uniform  prior,  this  indicates  that 
the  data  do  not  give  strong  evidence  that  players’  abilities  change  substantially  for  the 
period  of  the  World  Cup  tournaments.  This  result  makes  sense,  in  light  of  the  long  careers 
of  the  World  Cup  competitors,  17  months  may  not  be  much  time  for  players’  abilities  to 
improve  or  decline.  The  approximate  posterior  mean  anc  standard  deviation  of  a  are 
0.115  and  0.0812,  respectively.  Using  the  posterior  mean  of  a  as  a  summary,  players’ 
abilities  change  from  month  to  month  by  an  amount  with  a  standard  deviation  of  roughly 
0.115,  which  corresponds  to  a  20  point  rating  change  on  the  Elo  scale. 


5.4.2  Marginal  Posterior  Distribution  of  7<6> 

For  the  non-iterative  analysis,  we  compute  the  posterior  distribution  of  (V6)|Z?6,o')  for 
each  <7  €  {0*  ife?  t§5’  •  •  •»  From  the  previous  section,  we  have  the  approximate 

posterior  distribution  on  (trj^e),  so  we  can  compute  the  approximate  posterior  means  and 
standard  deviations  of  7^  marginalized  over  a.  Table  5.3  displays  these  values,  along  with 
approximate  95%  credible  intervals  for  7^.  The  95%  credible  intervals  were  computed 
by  generating  3000  values  from  the  marginal  distribution  of  7W  and  computing  the  2.5 
and  97.5  percentiles  empirically.  The  players  are  ranked  in  decreasing  order  according  to 
the  posterior  mean  of  7I6). 

The  non-iterative  analysis  indicates  that  the  posterior  means  range  from  —3.50  to 
1.94,  with  the  current  world  champion  having  the  highest  posterior  mean.  The  standard 
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Parameter 

Mean 

Std  Dev 

95%  Credible  Interval 

Kasparov 

1.94 

0.78 

(  0.37, 

3.52) 

Karpov 

1.82 

0.78 

(  0.24, 

3.37) 

Salov 

0.94 

0.76 

(-0.56, 

2.43) 

Timm  an 

0.62 

0.84 

(-0.96, 

2.38) 

Short 

■Ml 

0.76 

(-0.91, 

2.09) 

Nunn 

(-0.95, 

2.05) 

Vanderwiel 

0.53 

1.03 

(-1.46, 

2.50) 

Ljubojevic 

H 

0.78 

(-1.05, 

2.04) 

Ehlvest 

(-1.07, 

2.01) 

Hubner 

KEE1 

0.81 

(-1.19, 

2.00) 

Belyavsky 

III9EZI9 

(-1.27, 

1.90) 

Tal 

0.82 

(-1.33, 

1.86) 

Sokolov 

0.23 

0.80 

(-1.34, 

1.81) 

Andersson 

0.20 

0.79 

(-1.29, 

1.80) 

Portisch 

0.00 

0.76 

(-1.54, 

1.49) 

Seirawan 

-0.06 

wmm 

(-1.51, 

1.45) 

Vaganian 

-0.07 

mm*. a 

(-1.61, 

1.43) 

Ribli 

-0.10 

j—  **1 

(-1.61, 

1.42) 

Sax 

-0.11 

(-1.62, 

1.39) 

Speelman 

-0.15 

(-1.83, 

1.43) 

Spassky 

-0.25 

0.83 

(-1.93, 

1.33) 

Nikolic 

-0.25 

0.77 

(-1.74, 

1.24) 

Yusupov 

-0.28 

(-1.82, 

1.21) 

Korchnoi 

-0.30 

0.77 

(-1.79, 

1.18) 

Hjartarson 

-0.77 

0.78 

(-2.29, 

0.81) 

Noguieras 

-0.79 

HKSi] 

(-2.30, 

0.73) 

Petursson 

-1.32 

1.08 

(-3.43, 

0.82) 

Illescas 

-1.47 

1.04 

(-3.54, 

0.60) 

Winants 

-3.50 

1.26 

(-6.03, 

-1.06) 

Draw 

0.95 

IHiSEI 

(  0.76, 

1.14) 

Color 

-0.48 

HK2£i 

(-0.73, 

-0.23) 

Table  5.3:  Posterior  Tistribution  of  7^ 
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Figure  5.2:  Distribution  of  System  Standard  Deviation 

deviations  of  the  components  of  7^*  indicate  large  variability  of  the  posterior  distribu¬ 
tions.  The  variation  in  standard  deviations  may  be  explained  by  the  different  number  of 
games  in  which  players  were  involved.  For  example,  Vanderwiel,  who  played  in  only  one 
tournament,  has  a  standard  deviation  of  1.03,  whereas  players  like  Kasparov  and  Karpov 
who  played  in  four  tournaments  have  a  lower  standard  deviation.  The  95%  credible  in¬ 
tervals  show  a  wide  range  of  values  for  the  player  parameters,  although  the  correlations 
among  the  7W  are  about  0.5  (not  shown  on  table)  so  that  while  the  individual  standard 
deviations  are  large,  the  standard  deviation  of  the  differences  in  rating  parameters  are 
substantially  reduced.  This  makes  sense  because  the  data  provide  information  about  the 
difference  between  the  7 and  7^,  and  not  about  absolute  levels.  In  fact,  if  the  innova¬ 
tion  variance  for  the  model  were  0,  indicating  the  rating  parameters  do  not  change  over 
time,  the  posterior  correlation  among  the  ratings  could  be  expected  to  be  near  1.  The 
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Player  with 
White 

Number  of 
Months  after 
World  Cup 

Approximate  95% 
Credible  Interval  for 
Probability  of  Winning 

Approximate 
Median  Probability 
of  Kasparov  Winning 

Kasparov 

■HDI 

(0.254, 

0.572) 

0.408 

Kasparov 

6 

(0.218, 

0.609) 

0.404 

Kasparov 

12 

(0.203, 

0.635) 

0.407 

Kasparov 

60 

(0.089, 

0.807) 

0.408 

Short 

1 

(0.166, 

0.434) 

0.286 

Short 

6 

(0.160, 

0.430) 

0.282 

Short 

12 

(0.159, 

0.433) 

0.283 

Short 

60 

(0.124, 

0.431) 

0.276 

Table  5.4:  Winning  probability  estimates  for  Kasparov  against  Short 


credible  intervals  for  the  7^  appear  roughly  symmetric  around  the  means,  with  the  95% 
intervals  spanning  about  4  standard  deviations,  so  that  a  normal  distribution  may  be  a 
reasonable  approximation  to  the  marginal  posterior  distribution  of  7^. 

The  draw  and  color  parameters  have  posterior  means  close  to  their  prior  parameters. 
Both  the  draw  and  color  parameters  have  small  standard  deviations,  as  the  model  assumes 
these  parameters  do  not  vary  over  time.  The  credible  interval  for  the  color  parameter 
spanning  only  negative  values  indicates  a  significant  advantage  to  the  player  with  the 
white  pieces.  For  players  of  equal  rating,  the  value  of  A  =  .95  implies  that  games  will  be 
drawn  56.4%  of  the  time,  not  accounting  for  an  order  effect.  A  value  of  77  =  -.48  implies 
that  a  player  with  the  white  pieces  will  win  about  61.8%  of  the  games  that  result  in  either 
a  win  or  loss.  This  corresponds  to  an  83  point  advantage  for  white  on  the  Elo  rating  scale. 


5.4.3  Forecasting 

To  demonstrate  how  the  model  can  be  used  for  forecasting  future  outcomes,  we  com¬ 
pute  predictive  probabilities  for  games  played  between  Kasparov  and  Short  during  future 
competitions.  Table  5.4  shows  approximate  95%  credible  intervals  along  with  the  median 
estimate  of  the  probability  that  Kasparov  will  win  if  the  two  play  1  month,  6  months,  1 
year  and  5  years  after  the  World  Cup  events.  The  top  four  lines  are  computed  assuming 
Kasparov  is  white  and  the  last  four  assume  Short  is  white.  To  compute  a  given  proba¬ 
bility,  we  draw  5000  values  from  the  distribution  of  (u^Ue),  and  then  for  each  value  of 
u  we  compute  the  approximate  prior  normal  mean  and  variance  of  (7^|I?6)  as  and 
C<6)  4.  fc(7)A-I,  respectively.  Here,  takes  on  values  1,  5,  12,  and  60  depending  on  the 
number  of  months  beyond  the  completion  of  the  World  Cup  that  we  are  interested  in 
forecasting. 

Table  5.4  shows  a  large  amount  of  variability  in  the  credible  intervals  for  predicting  a 
Kasparov  victory.  As  more  time  passes  after  the  World  Cup  events,  the  credible  intervals 
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become  substantially  wider.  This  reflects  the  increasing  uncertainty  in  game  results  due  to 
the  passage  of  time.  The  approximate  credible  interval  when  Kasparov  plays  white  after  60 
months  have  passed  is  exceptionally  large,  suggesting  that  the  model  is  a  poor  forecaster 
for  Kasparov’s  performance  against  Short  as  white.  Curiously,  however,  the  increase  in 
the  credible  inte-  width  when  Short  plays  white  is  not  nearly  as  not  dramatic.  This 
may  be  due  to  t;  easonable  advantage  conveyed  by  playing  white. 


5.5  Model  Adequacy 


In  this  section,  we  diagnose  the  adequacy  of  the  model  fit  in  the  previous  section  using 
the  posterior  distribution  of  carefully  chosen  diagnostic  statistics  (Rubin  1984).  The  idea 
behind  posterior  predictive  model  checks  is  as  follows.  First,  a  model  is  fit  to  observed 
data.  Given  the  posterior  distribution  of  the  parameters,  replications  or  simulated  data 
sets  are  generated  by  drawing  “likely”  sets  of  parameters,  and  then  generating  simulated 
data  conditional  on  these  parameters.  Informative  statistics  are  constructed  to  test  the 
adequacy  of  the  model  against  reasonable  alternatives.  The  distribution  of  the  diagnostic 
statistics  over  the  simulated  data  can  be  used  as  a  reference  distribution  for  the  value  of 
the  diagnostic  computed  on  the  observed  data.  If  the  statistic  for  the  observed  data  is  an 
extreme  value  relative  to  the  distribution  of  the  statistic  over  the  simulated  data,  then 
the  adequacy  of  the  model  is  called  into  question.  We  examine  the  adequacy  of  our  model 
using  posterior  predictive  checks  in  Section  5.5.1.  The  posterior  checks  indicate  problems 
in  either  the  computational  precision  of  the  analysis  or  in  the  model  specification.  Ac¬ 
cordingly,  in  Section  5.5.2  we  simulate  tournament  data  and  perform  posterior  predictive 
checks  to  examine  whether  the  computational  approximations  used  in  the  analysis  are  too 
inaccurate  for  a  moderate  sized  data  set  like  the  World  Cup. 


5.5.1  Posterior  Predictive  Checks  for  the  Model 

Posterior  predictive  checks  were  performed  by  simulating  tournament  data  according  to 
the  posterior  distribution  of  the  parameters.  To  simulate  a  single  sequence  of  six  tourna¬ 
ments,  we  first  draw  from  the  joint  posterior  distribution  of  all  parameters.  This  is  done 
by  carrying  out  the  following  steps. 

1.  Draw  u  from  the  approximate  (discrete)  posterior  distribution  of  (w|Z?6)- 

2.  Draw  (A,tj|u>)  from  their  normal  posterior  distribution.  Section  4.4.1  describes  the 
procedure  for  obtaining  the  approximate  normal  distribution  of  the  parameters  after 
the  last  tournament. 

3.  Conditional  on  u  and  (A,  17),  compute  the  parameters  of  the  approximately  normal 
conditional  likelihood  of  (V^jA,  q,w,dt)  for  each  single  tournament  at  time  t.  Using 
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the  procedure  described  in  Section  4.6.2  for  the  analysis  of  the  Gibbs  sampler,  com 
pute  the  parameters  for  the  distribution  of  (-f^\X,T),u,D6),  and  then  draw  once 
from  this  distribution. 

4.  For  t  =  5, ...,1,  draw  -yM  from  the  approximately  normal  conditional  distribu¬ 
tion  given  the  previous  draws  of  A,q,w.  Thi6  is  accomplished  using 

the  procedure  in  Section  4.6.2  for  drawing  from  the  joint  posterior  distribution  of 
parameters  given  u  in  the  Gibbs  sampler. 


For  each  of  the  six  tournaments,  we  generate  simulated  game  outcomes  according  to 
the  actual  tournament  design,  conditioned  on  the  parameters  drawn  from  the  posterior 
distribution.  The  game  outcomes  are  generated  according  to  the  outcome  probabilities 
of  the  extended  Davidson- Beaver  model  according  to  Section  5.2.  This  process  was  per¬ 
formed  500  times  to  generate  a  sample  of  500  collections  of  6  simulated  World  Cup  tourna¬ 
ments.  For  each  collection  of  6  tournaments,  C,,  i  =  1, . . . ,  500,  we  compute  the  diagnostic 
statistic  T(Ci)  relevant  for  detecting  model  failures. 

We  performed  two  tests  of  adequacy  of  the  model.  The  first  diagnostic  statistic  is  the 
maximum  proportion  of  wins  per  player  out  of  games  that  result  in  either  a  win  or  loss.  An 
unusually  large  or  small  value  would  indicate  that  the  model  does  not  capture  the  player 
to  player  variability.  To  check  whether  a  single  tie  parameter  sufficiently  describes  the 
variation  in  drawn  games  across  players,  the  second  diagnostic  computes  the  maximum 
across  players  of  the  proportion  of  drawn  games.  Figure  5.3  shows  the  distribution  of  each 
of  the  two  diagnostic  statistics  for  the  collection  of  500  simulated  World  Cup  matches  along 
with  a  vertical  line  representing  the  statistic  computed  from  the  actual  data.  The  first 
plot  shows  that  the  maximum  proportion  of  wins  for  any  player  from  the  actual  data  is 
well  within  the  range  of  values  for  the  simulated  tournaments.  This  suggests  that  the 
model  parametrization  sufficiently  captures  the  strength  of  the  players  by  a  single  rating 
parameter  per  player  varying  over  time.  The  second  plot,  on  the  other  hand,  indicates 
a  possible  problem  with  the  model.  The  maximum  proportion  of  draws  across  players 
for  the  actual  data  is  at  the  right  extreme  of  the  distribution  for  simulated  data.  The 
model  does  not  generate  data  sets  with  as  much  variability  in  draws  as  the  observed  data 
indicates.  This  implies  that  a  single  tie  parameter  may  not  be  adequately  describing 
the  frequency  with  which  players  draw  games.  If  a  single  tie  parameter  were  a  sufficient 
description,  the  plot  shows  that  the  player  with  the  greatest  percentage  of  draws  would 
be  expected  to  draw  about  70-75%  of  the  time,  whereas  the  actual  data  yields  a  player 
(Ribli)  who  draws  86%  of  his  games. 

While  the  preceding  analysis  suggests  that  alternative  models  be  explored  for  describ¬ 
ing  chess  game  outcomes  for  the  World  Cup  tournaments,  the  problem  may  stem  from 
difficulties  with  the  computational  approximations.  The  posterior  distribution  of  parame¬ 
ters  is  obtained  by  approximating  each  of  the  likelihoods  by  a  multivariate  normal  density, 
and  as  each  tournament  consists  of  one  comparison  for  each  pair,  the  likelihood  may  not 
be  adequately  approximated  by  a  normal  density.  Furthermore,  in  updating  parameters 
to  include  a  new  tournament,  the  posterior  distribution  of  parameters  is  not  approximated 
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Figure  5.3:  Distribution  of  test  statistics  for  model  adequacy 


at  the  exact  posterior  mode,  but  at  a  parameter  value  that  is  close  to  the  posterior  mode 
only  if  the  likelihood  is  approximately  normal  (see  Section  4.2).  To  illustrate  how  the 
normal  approximation  for  the  World  Cup  data  may  contribute  to  the  inadequacy  of  the 
model,  Table  5.5  displays  the  posterior  means  of  (7W,  A,  q)  in  two  situations.  The  first 
column  of  means  is  computed  using  the  approximately  conjugate  updating  of  Section  4.4 
given  <7  =  0  with  the  prior  distribution  specified  in  Section  5.2,  except  that  the  prior  vari¬ 
ances  are  100  rather  than  10  to  allow  the  data  to  dominate  the  distribution  of  parameters. 
This  column  assumes  that  each  of  the  six  likelihoods  can  be  reasonably  approximated  by 
a  multivariate  normal  density.  The  second  column  of  means  is  computed  by  treating  all 
six  World  Cup  tournaments  as  a  single  large  tournament,  and  finding  the  posterior  means 
by  updating  the  prior  parameters  exactly  once.  Because  <7  =  0,  the  two  columns  will  be 
similar  if  the  normal  approximations  are  satisfactory.  The  parameter  means  in  the  second 
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Posterior  Mean 

Posterior  Mean 

Parameter 

Sequential  Updates 

Single  Update 

Anders  son 

0.19 

Belyavsky 

0.44 

0.46 

Ehlvest 

0.55 

0.62 

Hjart  arson 

-0.68 

-0.75 

Hubner 

0.47 

0.51 

Illescas 

-1.54 

-1.67 

Karpov 

1.90 

2.10 

Kasparov 

2.02 

2.24 

Korchnoi 

-0.36 

-0.44 

Ljubojevic 

0.55 

0.55 

Nikolic 

-0.23 

-0.27 

Noguieras 

-0.77 

-0.81 

Nunn 

0.59 

0.62 

Petursson 

-1.39 

-1.51 

Portisch 

-0.01 

Ribli 

-0.09 

-0.05 

Salov 

1.00 

1.08 

Sax 

-0.18 

-0.17 

Seirawan 

-0.09 

-0.05 

Short 

0.58 

0.64 

Sokolov 

0.21 

0.24 

Spassky 

-0.22 

-0.19 

Speelman 

-0.08 

-0.07 

Tal 

0.28 

0.29 

Timman 

0.39 

0.33 

Vaganian 

-0.09 

Vanderwiel 

0.57 

0.61 

Winants 

-3.82 

-4.08 

Yusupov 

-0.27 

-0.34 

Draw 

0.95 

1.06 

Color 

-0.47 

-0.51 

0 


Table  5.5:  Posterior  means  of  (7^,  A,  77)  with  a  = 
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column  reflect  the  data  more  accurately  because  the  computation  only  relies  on  a  single 
normal  approximation.  While  the  parameter  means  for  individual  parameters  are  close, 
some  are  far  enough  apart  to  be  of  some  concern.  For  example,  the  posterior  mean  for 
A  differs  by  over  0.1,  which,  according  to  Table  5.3,  is  more  than  one  posterior  standard 
deviation.  So  the  analyses  that  produce  Figure  5.3  may  indicate  that  the  computational 
approximations  are  too  coarse. 


5.5.2  Evaluating  the  Effect  of  the  Normal  Approximation 

Model  monitoring  via  predictive  checks  in  Section  5.5.1  raise  doubts  about  whether  the 
proposed  model  for  chess  game  outcomes  of  Section  5.2  is  appropriate.  The  preceding 
section  suggests  that  the  failure  may  occur  because  the  normal  approximations  to  the 
Davidson- Beaver  likelihoods  are  not  accurate  enough.  We  provide  evidence  in  this  section 
that  when  players  compete  a  moderate  number  of  times  in  each  tournament,  the  normal 
approximation  of  the  likelihoods  does  not  disturb  the  predictive  checks. 

The  data  used  for  the  analysis  in  this  section  was  obtained  by  simulating  the  World 
Cup  tournaments  with  the  exact  same  tournament  schedule,  except  players  would  compete 
against  each  other  four  times  (twice  with  each  color)  instead  of  just  once.  This  resulted 
in  a  single  simulated  World  Cup  consisting  of  4  X  789  =  3156  game  outcomes. 

Game  outcomes  are  generated  as  follows.  Strength  parameters,  for  the  first 
tournament  were  generated  from  N(0, 1).  For  successive  tournaments,  -y<*)  was  obtained 
by  adding  Gaussian  noise  with  a  variance  of  .25 k^\  where  is  the  number  of  months 

between  tournaments  t  —  1  and  t,  given  in  Section  5.2.  This  is  equivalent  to  setting  a  —  .5. 
Also,  we  set  A  =  1  and  T)  =  —.5.  From  these  parameters,  we  generated  data  for  the  t-th 
tournament  according  to  the  World  Cup  tournament  schedule  with  four  games  per  pair 
rather  than  just  one. 

Using  the  non-iterative  methodology  of  Section  4.4,  the  posterior  distribution  of  the 
parameters  was  computed  from  the  simulated  data.  Predictive  checks  on  the  data  were 
performed  by  generating  150  sets  of  tournaments  consisting  of  3156  games  each.  Figure  5.4 
displays  the  distribution  of  the  diagnostic  statistics  used  in  Section  5.5.1  for  the  simulated 
data.  As  in  Figure  5.3,  the  vertical  line  on  the  plots  represents  the  initial  simulated 
World  Cup  data,  and  the  histogram  represents  the  150  replicated  sets  of  tournaments.  In 
contrast  to  Figure  5.3,  both  the  observed  statistics  fall  within  the  body  of  the  simulated 
reference  data.  These  two  predictive  checks  indicate  that  it  may  be  appropriate  to  use  the 
normal  approximation  of  the  likelihoods  and  the  approximate  parameter  updates  when 
the  number  of  games  per  player  pair  is  as  few  as  four. 
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Proportion  of  Draws 


Figure  5.4:  Distribution  of  test  statistics  for  simulated  data 

5.6  Examination  of  the  Gibbs  sampler  on  Simulated  Chess 
Outcomes 


Section  5.3.1  has  shown  that  the  Gibbs  sampler  failed  to  converge  in  500  iterations  for  the 
analysis  of  the  World  Cup  chess  data,  possibly  because  the  data  from  single  round-robin 
tournaments  does  not  contain  a  great  deal  of  information  about  the  innovation  variance, 
<r2.  In  this  section,  we  generate  simulated  tournament  data  with  known  innovation  vari¬ 
ance  and  perform  the  Gibbs  sampler  at  dispersed  starting  values  for  ui  to  explore  the  effect 
of  increasing  the  amount  of  data  into  the  analysis. 

The  data  are  generated  as  follows.  Seven  players  are  assigned  values  of  7^  by  making 
random  draws  from  a  standard  normal  distribution.  For  t  =  2, . .  .,6,  we  obtain  7^  by 
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adding  independent  random  draws  from  a  standard  normal  distribution  to  V'-11.  This 
process  results  in  known  values  of  7I4)  for  7  players  over  6  equally  spaced  time  periods. 
Although  the  7^  are  generated  using  e2  —  1,  due  to  sampling  variability  the  mean  squared 
difference  among  the  7^  and  7(<_1)  is  0.548  for  the  sample.  We  assign  q  =  —0.25  and 
A  =  1.0.  We  then  generated  game  outcomes  for  ten  round-robin  tournaments  for  each  t 
according  to  the  model  of  Section  5.2  using  the  ratings  7^  and  the  values  of  q  and  A.  At 
each  f,  each  competitor  played  every  other  ten  times;  five  times  as  white  and  five  times 
as  black. 

The  Gibbs  sampler  with  a  Metropolis  step  was  performed  as  described  in  Section  5.3.1. 
Four  parallel  samplers  were  run  with  starting  values  for  u  of  1/52,  1/22,  1/.52  and  1/.012. 
The  initial  values  of  7W  were  6et  to  0,  and  values  of  7W  for  t  =  2, . .  .,6  were  obtained 
by  adding  Gaussian  noise  with  variance  l/u>  to  7<‘-1).  The  Metropolis  step  accepted  new 
draws  at  a  rate  of  approximately  82%. 

Figure  5.5  shows  the  value  of  u  drawn  at  each  iteration  for  all  four  samplers.  The 
sampler  series  corresponding  to  starting  values  u  =  1/22, 1/.52, 1/.012,  appear  to  move 
towards  the  correct  stationary  distribution  for  this  parameter.  The  sampler  series  starting 
at  u  =  1/52  shows  a  lack  of  convergence  to  the  correct  distribution.  To  examine  whether 
the  difficulty  at  u  —  1/52  is  a  result  of  the  likelihood  of  the  simulated  data  not  containing 
sufficient  information,  the  sampler  was  rerun  with  100  replications  of  the  simulated  data 
(1000  comparisons  per  player  pair).  In  this  case  the  series  moved  quickly  towards  the 
correct  stationary  distribution. 

Several  possibilities  exist  for  explaining  the  problems  associated  with  the  Gibbs  sam¬ 
pler  analysis  on  the  simulated  data.  The  information  contained  in  the  simulated  data, 
which  consists  of  10  game  outcomes  per  player  pair,  may  still  not  be  sufficient  to  guar¬ 
antee  convergence  for  the  dynamic  paired  comparison  model.  With  100  times  as  many 
observations,  the  data  appear  to  have  enough  of  an  effect  to  make  the  Gibbs  sampler 
converge  quickly.  A  possibility  that  may  explain  Figure  5.5  is  that  the  Gibbs  sampler  is 
at  a  minor  mode  in  the  posterior  distribution  of  u.  If  the  distribution  is  reasonably  flat  in 
a  neighborhood  of  this  mode,  the  Gibbs  sampler  may  take  many  iterations  to  jump  out  of 
this  region  of  the  parameter  space  to  find  the  highest  likelihood  regions.  This  phenomenon 
is  discussed  in  Gelman  and  Rubin  (1992b). 


5.7  Discussion 


Analyzing  paired  comparison  data  using  the  techniques  of  Chapter  4  appears  to  be  feasible 
when  the  paired  comparison  model  is  correct  and  a  reasonable  number  of  games  are  played 
between  pairs  of  players.  We  find,  however,  that  when  the  number  of  games  in  the  data 
set  is  small  relative  to  the  number  of  parameters,  the  analysis  of  game  outcomes  using 
the  methodology  of  Chapter  4  may  not  provide  accurate  inferences.  Diagnostic  checks 
of  the  proposed  model  for  the  World  Cup  indicate  that  improvements  are  needed,  either 
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Iteration 


Figure  5.5:  Values  of  log10w  for  Gibbs  sampler  on  simulated  data 

in  computing  the  posterior  distribution  or  in  the  model  specification.  In  this  section,  we 
discuss  approaches  to  improving  the  analysis  of  the  World  Cup  data  set. 

One  feature  of  the  World  Cup  data  set  and  of  other  sports  competitions  is  that  within 
each  tournament  players  compete  against  each  other  at  most  once.  For  a  p-player  system, 
a  single  round-robin  tournament  involves  p(p  —  l)/2  comparisons  and  p  +  2  parameters. 
Portnoy  (1988)  shows  that  for  exponential  families  with  p  parameters  and  n  observations, 
if  p2/n  is  not  small,  the  likelihood  may  not  be  well  approximated  by  a  normal  density.  In 
the  case  of  single  round-robin  tournaments,  this  ratio  can  be  viewed  as  large.  The  updating 
of  parameters  described  in  Section  4.4.1  requires  the  likelihood  to  be  approximately  normal 
in  order  to  justify  the  conjugate  analysis.  Because  the  likelihoods  in  the  World  Cup  data 
set  may  not  be  reasonably  approximated  by  normal  densities,  more  accurate  methods  to 
fit  the  model  seem  necessary. 
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The  computation  can  be  made  more  accurate  for  the  small  sample  analysis  by  com¬ 
puting  the  exact  posterior  mode  before  making  the  normal  approximation  as  well  as 
computing  the  Hessian  matrix  evaluated  at  the  exact  mode  at  each  updating.  In  Sec¬ 
tion  4.4.1  we  just  approximate  the  likelihoods  by  a  normal  distribution  centered  at  the 
maximum  likelihood  estimate,  and  then  combining  this  normal  approximation  with  the 
normal  prior  distribution.  When  the  likelihood  is  normal,  this  gives  the  exact  posterior 
mean;  otherwise,  the  result  is  an  approximation  which  is  coarse  if  the  likelihood  is  not 
well  approximated  by  a  normal  density.  Instead,  the  exact  posterior  mode  can  be  found 
by  performing  the  Newton- Raphson  algorithm  on  the  product  of  the  prior  density  and  the 
exact  likelihood.  Then  the  normal  approximation  can  be  used  directly  on  the  posterior 
distribution,  centered  around  the  posterior  mode.  The  difference  in  computational  cost 
of  using  this  approach  in  place  of  the  approach  of  Section  4.4.1  is  that  for  every  w,  in 
the  discrete  space  {wj , . .  .  ,u;m},  the  Newton- Raphson  algorithm  is  invoked  T  times  (once 
for  each  tournr.  ent)  to  obtain  the  posterior  distribution  of  parameters  after  each  tour¬ 
nament.  Thus,  mT  Newton- Raphson  calculations  are  performed.  The  approach  taken  in 
Section  4.4.1  requires  the  maximum  likelihood  estimates  to  be  found  only  once  for  each 
tournament,  resulting  in  a  total  of  T  Newton- Raphson  analyses.  It  should  be  noted  that 
even  finding  the  correct  posterior  mode  after  each  update  and  approximating  the  likeli¬ 
hoods  by  normal  densities  at  this  true  mode  may  not  be  an  adequate  approximation  if 
the  likelihood  is  substantially  non-normal.  In  such  a  case,  accurate  inferences  may  be 
obtained  using  iterative  simulation  such  as  through  the  G  obs  sampler. 

The  problem  using  the  Gibbs  sampler  is  dearly  the  computational  burden.  For  the 
World  Cup  data,  the  Gibbs  sampler  did  not  converge  within  500  iterations.  The  rate 
of  convergence  has  been  shown  to  be  related  to  the  dependence  among  the  parameter 
distributions  (Liu  1992),  and  with  a  small,  the  dependence  of  the  7W  across  t  is  large  so 
that  the  convergence  can  be  expected  to  be  slow.  Using  the  Gibbs  sampler  for  analyzing 
actual  paired  comparison  data  sets,  therefore,  may  not  be  computationally  feasible. 

Putting  aside  computational  issues,  the  probability  model  for  chess  game  outcomes 
may  be  further  extended.  The  most  likely  misspecification  of  the  model  is  that  a  single 
tie  parameter,  A,  is  not  sufficient  for  describing  players’  tendencies  to  draw  games.  An 
alternative  model  replaces  A  by  (A,  +  Aj)/2  when  players  t  and  j  compete,  where  A,  is 
an  individual  tie  parameter  for  player  i.  This  model  may  more  realistically  reflect  the 
notion  that  some  players  are  more  likely  to  draw  games  than  others.  The  probability 
of  a  draw  is  now  a  function  of  the  average  of  the  draw  parameters  for  the  two  players 
involved  in  a  game,  and  the  larger  the  average,  the  more  likely  the  outcome  of  the  game 
will  be  a  draw.  Glickman  (1991)  proposes  this  parametrization  for  the  draw  parameters, 
substituting  the  average  of  draw  parameters  in  the  context  of  a  model  proposed  by  Rao 
and  Kupper  (1967).  Joe  and  White  (1992)  examine  a  model  for  chess  players  that  replaces 
exp((A,  +  Aj)/2)  in  the  Glickman  model  by  exp(A,)  +  exp(Aj). 

Another  class  of  alternative  models  involve  treating  either  A  or  17  (or  both)  as  varying 
over  time.  As  chess  theory  develops,  the  understanding  of  the  game  may  change  such 
that  the  frequency  of  draws  among  top  players  may  change,  and  the  advantage  conferred 
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to  white  may  also  change.  For  the  World  Cup  data  set,  the  playing  conditions  at  certain 
tournaments  may  have  been  such  that  the  players  choose  to  end  their  games  in  quick  draws. 
Another  possibility  is  that  if  a  player  is  not  feeling  well  at  a  particular  tournament,  he 
may  attempt  to  draw  games  more  often  than  usual.  A  model  with  time  varying  A  and 
17  would  require  an  evolution  variance  parameter  for  the  time- varying  parameters  which 
would  need  to  be  estimated. 


Chapter  6 


Conclusions 


This  thesis  presents  paired  comparison  models  in  which  the  merits  of  the  objects  being 
compared  may  change  over  time.  Chapters  2  and  3  address  paired  comparison  experiments 
with  a  continuous  outcome  variable,  while  chapters  4  and  5  address  paired  comparison 
experiments  with  indicator  outcome  variables.  The  models  developed  in  this  thesis  have 
strong  connections  to  the  Bayesian  dynamic  models  of  West  and  Harrison  (1990).  In  the 
models  we  consider,  every  object  involved  in  a  comparison  has  an  associated  rating  pa¬ 
rameter  which  measures  the  relative  performance  or  ability  of  the  object.  The  observation 
equation  of  the  model  relates  the  outcome  of  a  comparison  to  the  rating  parameter,  and 
the  system  equation  describes  the  evolution  of  the  rating  parameters  over  time.  This 
thesis  demonstrates  methods  for  obtaining  posterior  inferences  for  the  ratings  under  the 
model  and  forecasts  for  future  comparisons. 

Two  approaches  to  the  data  analysis  are  examined.  One  approach  bases  inferences 
on  samples  drawn  from  the  posterior  distribution  of  parameters  using  the  Gibbs  sampler. 
The  other  approach  approximates  the  prior  distribution  of  the  system  variance,  <r3,  by  a 
discrete  distribution  on  a  grid  of  values.  This  approximation  results  in  a  more  tractable 
analysis  when  marginalizing  over  the  posterior  distribution  of  cr2.  This  second  approach 
proves  to  be  the  more  successful  of  the  two  in  practice.  The  analyses  of  NFL  football 
game  outcomes  by  the  Gibbs  sampler  and  by  the  non-iterative  approach  result  in  similar 
inferences,  suggesting  that  the  discrete  approximation  to  the  prior  distribution  of  a1 
was  entirely  satisfactory.  In  the  analysis  of  the  World  Cup  chess  game  outcomes,  the 
Gibbs  sampler  does  not  converge  in  a  computationally  feasible  amount  of  time.  While 
the  computational  approximations  for  the  non-iterative  analysis  of  the  World  Cup  data 
present  some  difficulties,  these  can  be  improved  at  only  a  moderate  computational  cost. 

The  approach  we  present  in  this  paper  provides  a  flexible  means  to  modeling  paired 
comparison  data.  The  framework  is  quite  general  and  extensions  to  these  models,  such 
as  incorporating  non-dynamic  parameters  or  specifying  alternative  tie  parameters,  can  be 
incorporated  without  changing  the  underlying  structure  of  the  basic  models. 
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